-
Notifications
You must be signed in to change notification settings - Fork 235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes performance problem cause by deep hierarchies, ResolveAll, and unusable constructors #158
base: master
Are you sure you want to change the base?
Conversation
It would be interesting to add a commit before this one with new performance tests to illustrate the issue, then run again after the change to show the difference. Is that feasible? |
Sure thing! I won’t be able to practically create a hierarchy as large as a production one, but I should be able to make one large enough to get a reasonably interesting test. I’ll start on this first thing tomorrow. |
to verify that it's dependencies can be resolved. The way it handles this is to reflect a list of all constructors from the dependency and to go through them checking to see if all parameters for the constructor can themselves be constructed. This continues recursively, until a constructor that can be satisfied can be found for each dependency and the original object. After finding that an object can be resolved, TinyIoC then begins construction, a process that requires performing the previous calculation again to determine the correct constructor, as no data is cached. This alone causes an slight performance degredation, but the problem only appears when multiplied by the prescence of large dependency trees, multiple resolution, or the presence of constructors that cannot be successfully resolved. In all three cases, the performance problem itself is a result of repeated recalculation of constructability for the same types. **Large Dependency Tree** Consider the below as an example of a small "large dependency tree". ``` class Foo { public Foo(Bar bar, Baz baz, Qux qux) { } } class Bar { public Bar(Baz baz, Qux qux) { } } class Baz { public Baz(Qux qux) { } } class Qux { } ``` Due to the lack of caching, a request to resolve Foo leads to the following: - Check if Foo can be resolved - Check if Bar can be resolved - Check if Baz can be resolved - Check if Qux can be resolved - Check if Qux can be resolved - Check if Baz can be resolved - Check if Quz can be resolved - Check if Qux can be resolved This totals 1 check for Bar, 2 for Baz, and 3 for Qux. For a total of 6 tests for eligability. As the same logic must be performed again during the resolution stage, after we are certain that a constructor chain exists, the effective number of tests is doubled, coming in at 12. **Multiple Resolutions** Another possibility is that we have a number of event handlers all of which need to be resolved and which share dependencies, such as one might expect of a database or logger. ``` public class FooEventHandler1 : IFooEventHandler { public FooEventHandler1(Foo foo) { } } public class FooEventHandler2 : IFooEventHandler { public FooEventHandler2(Foo foo) { } } public class FooEventHandler3 : IFooEventHandler { public FooEventHandler3(Foo foo) { } } public class FooEventHandler4 : IFooEventHandler { public FooEventHandler4(Foo foo) { } } ``` Assuming the above classes are all registered with TinyIoC as IFooEventHandler, then a call to `ResolveAll<IFooEventHandler>()` would check that the common dependency Foo could be resolved on each handler resolution, for a total of four checks of the same type. If, continuing with the class definitions from the previous example, we remember that each check of Foo causes 6 tests to occur, followed by another 6 during the actual construction of the object, then we could expect this call to ResolveAll to cause 6 * (4 + 1) test for a total of 30 tests. **Unusable Constructors** There's still one more case that leads to test-multiplication: if the most preferable constructor (the one requiring the most arguments) has a dependency that cannot be resolved via IoC, such as for the case where it is generated by a factory or a deserializer. ``` class Foo { public Foo(Bar bar, Baz baz, Qux qux, Bang bang) { } public Foo(Bar bar, Baz baz, Qux qux) { } } ``` If we were to update our original Foo class to now have a constructor that requires an unresolvable type "Bang", the effect is that we manage to double the test performed: TinyIoC will test Bar, Baz, and Qux successfully before finding that Bang cannot be constructed. With the knowledge that this constructor cannot be resolved, TinyIoC will then try the next constructor in line, attempting a second time to resolve Bar, Baz, and Qux again, not remembering that we've already found them to be resolvable. The effect here is (4 + 1) * (6 + 6 + 1): Each of the 4 IFooEventHandlers requiring the chain be tested, followed by one final test in during construction in order to find the appropriate constructor. The result, given these simplistic classes and their hierarchy, is 65 tests even though there are only 10 constructors defined between the 9 classes. **Summary** Given these three very real and very common patterns, the effective number of checks can be in the millions in a larger codebase, taking 20-30 seconds per request, depending on the number of handlers needing construction. **Resolution** The resolution for this issue is rather straightforward: after testing a Type or constructor, the result should be stored so that it does not have to be recalculated. My first approach was to store the value within TinyIoC so that separate calls to Resolve or ResolveAll would be able to use a common cache. This turned out to be non-practical though, since calls to these methods can specify extra parameters that affect resolution, such as named resolution. Using a global cache would require that either named resolutions were not cached or that logic was complicated to cache them while preventing a memory leak. Given that limitation, I took the approach of creating a cache that gets passed around internally over the lifetime of the Resolve call and gets populated as tests occur. This means that repeated calls to Resolve will perform the same checks, but that each check will only be performed a single time within a Resolve, changing what could be quadradic growth of time into linear time.
I've added a test using a deep (though quite ugly) hierarchy. Locally, I had to change the project to run on .NET 4.6.1 due to missing SDK dependencies for 4.5, but I did not check that change in. (In case you are curious why the table shows .NET 4.6.1 instead 4.5) == Results Start Here == // * Summary * BenchmarkDotNet=v0.12.1, OS=Windows 10.0.20348 Jit=RyuJit Platform=X64
|
@gkarabin : I've attached the benchmark, as requested. Is there anything else that I need to do? |
I can’t speak for the maintainers, like @niemyjski or @grumpydev - they would have to accept/merge. I want to run the benchmarks myself and look for regressions. I also want to spend some time digging into the code and understanding it. Unfortunately, with the holidays and deadlines for my job I haven’t made time for it. It’s on my to-do list, but I can’t make a firm commitment. |
That makes complete sense. I'm just a little over eager. I'll tone it down a little. :) |
@JoshuaRogers thanks for the PR, I'm in the same boat as @gkarabin when it comes to time. Really appreciate the effort you have put into this. At first, I was a bit concerned seeing all the new instances of ResolutionCache being created but then read your thread. Some questions that I had:
|
@gkarabin @grumpydev Would love any feedback that you have, it does look ok from looking at files view, but I haven't ran the code locally. |
Looks like the build failed for this, we may need to update the build. I'm fine with cutting unsupported build targets that are EOL. |
I finally had a chance to take a closer look at the benchmark today. I wanted to see how the existing benchmarks behave after your change. I noticed that at least in your PR, you hadn't updated TInyIoCOriginal.cs (which you need to do, or your benchmark data would be against the original data). I bundled up the results of 3 runs - the main branch from the grumpdev repo, from the main branch of your repo, and from the main branch of your repo, with TInyIoCOriginal.cs updated to have the contents of the grumpdev main branch: I only used .NET Core 3.1, and did not use .NET Framework (which is how your tests were done). I suspect that it does not matter, though. Your new tests are an order of magnitude faster when ran using your PR compared to the updated TinyIoCOriginal.cs using the grumpydev main branch. The other tests perform worse, but it's the same order of magnitude. So I guess the question is, is the performance improvement on the new tests worth the performance drop on the old tests? That's a hard question to answer. Which test set is more like typical workloads? I don't have a good answer, at least not yet. I wonder what others think - @niemyjski , @grumpydev , @Yortw - thoughts?
|
These results are very concerning. I use TinyIoC in game development and my projects don't have deep hierarchy's. The Resolve Instance Without Dependencies result is the most concerning. Due to the high frequency this is called in my projects, it would most likely impact frame rate. Selfishly, I hope this change does not get merged. One shoe does not fit all. Perhaps this could be made a feature that can be turned on for projects that could benefit? |
That makes sense. For us, this is used in our web server with DeepHierarchy being the most common case, making TinyIoC the source of most of our time. I believe that most of the degradation could be avoided by initializing the objects I added all up front. The only reason I didn’t was because that removed my ability to add caching for named resolution (a feature that I don’t actually use myself). If we were okay with named resolution not seeing any increase in speed for DeepHierarchies, I could make that change instead. |
Hi all, I'm coming to this late, and haven't actually looked a the code yet (limited time, sorry). I have read the thread here and looked at the benchmarks posted, which all seems like good information. I think my use case sits between that of @JoshuaRogers and @zaddo67 - I use TinyIoC mostly in native client apps (Xamarin Forms, WinForms, console apps sometimes etc). For the Xamarin apps I have some that run on very underpowered ARM devices where hardware choice is limited, and time between screens is a major UX factor, so milliseconds do matter. I tend not to use IoC for classes that have no depedencies, but I have a lot of singletons (or effectively singletons, within a scope, but my scopes probably last a lot longer than a typical HTTP request). So I'm concerned about the degradation there, and I support @zaddo67 's concern about the resolution of items with no dependencies anyway, I suspect (with no evidence) that's pretty common and not a case we'd like to degrade. On that basis I'm worried about merging the code here as is. That said, the web case is definitely a common one too and the improvement in performance for the deep hierarchy handling is significant, so I think it's worth solving. I also just like/think the caching idea is a good idea. I thought I'd done some of this in my own PR for performance improvements (possibly just at a different level?). For my 2cents worth I would like to re-evaluate this statement from earlier in the thread:
I stand to be proven wrong but I think that if the cache persisted across multiple calls then probably all the calls would be faster (at least in the benchmark, but probably most of them in the real world too), with the noted exception of the named arguments. We'd also potentially eliminate/reduce the issues raised by @niemyjski around the dictionary resizing/lifetime etc. Given the comments above, the easy path would seem to be to cache everything that isn't named (or ignore the cache when a named parameter is present, I'm not sure which since I haven't dived into the code yet). Obviously this leaves the original problem of the named arguments not benefitting, but:
So my vote would be to change the code to use a shared cache (per container probably), ignoring it for the named resolution cases, and repeat the benchmarks to see what the outcome is. If the old benchmarks are the same or better, and the new ones are also good I think that's a win even despite the named arguments issue. It's possible if that happens even my use case would gain some performance benefits, whereas with the "cache per call" as it is now it seems to just be adding overhead. |
Also, really exciting to see the benchmark stuff I submitted being used to analyse and potentially prevent regressions. Thanks everyone for the effort being put in here. |
That works by me: I don’t use named arguments either. I was originally including them for completeness sake (though at the cost of complication.) I’ll go update my PR to do one cache per container and to bypass it when using named arguments. As for not replacing TinyIoCOriginal.cs, I misunderstood “original” in that context to mean that I shouldn’t touch it (as then it would no longer be original, but instead modified.) I didn’t realize the intent was to copy the current TinyIoC.cs onto it before the test. :p I’ll make sure to update it when running the next set of benchmarks. |
I'm not overly worried about the named resolution use case. I do think that named resolution is a pretty common case - I don't think it would have made the short list of TinyIoC features if it wasn't. My team uses it in our applications, but not in performance sensitive contexts - it tends to be at the beginning of a long-duration scope for identifying a subset of available implementations to use for a particular mode of operation, and not for a lot of resolutions at that. If it doesn't get much slower, especially compared to the state of things before @Yortw's improvements, I don't see a problem. |
I really appreciate the collaboration and help around this. I'm not a real power user of this library, just here to try and help maintain it/provide feedback where I have experience. I mainly use it for simple singletons in the https://github.com/exceptionless/Exceptionless.Net library. I think we should try to avoid hurting the performance of the simplest of use cases that probably will have the most library usage (we have no stats for this without analyzing GitHub and your comments). |
Hi, sorry for the wall of text. I've tried to break this into a few sections and for each one show how the performance problem manifests. I felt that it was necessary to understand the reasoning for the approach at the end. If you prefer to skip to the end, the TL;DR is that "CanResolve, Resolve, and ResolveAll recalculate the same values repeatedly for some structures, causing multiple second delays."
When resolving an object from TinyIoC, before resolution, it first needs to verify that it's dependencies can be resolved. The way it handles this is to reflect a list of all constructors from the dependency and to go through them checking to see if all parameters for the constructor can themselves be constructed. This continues recursively, until a constructor that can be satisfied can be found for each dependency and the original object. After finding that an object can be resolved, TinyIoC then begins construction, a process that requires performing the previous calculation again to determine the correct constructor, as no data is cached.
This alone is unnoticable, though it is a small example of the problem that appears to grow at scale if using large dependency trees, ResolveAll, or the presence of constructors that cannot be successfully resolved. In all three cases, the performance problem itself is a result of repeated recalculation of constructability for the same types.
Large Dependency Tree
Consider the below as an example of a small "large dependency tree".
Due to the lack of caching, a request to resolve Foo leads to the following:
This totals 1 check for Bar, 2 for Baz, and 3 for Qux. For a total of 6 tests for eligability. As the same logic must be performed again during the resolution stage, after we are certain that a constructor chain exists, the effective number of tests is doubled, coming in at 12.
Multiple Resolutions
Another possibility is that we have a number of event handlers all of which need to be resolved and which share dependencies, such as one might expect of a database or logger.
Assuming the above classes are all registered with TinyIoC as IFooEventHandler, then a call to
ResolveAll<IFooEventHandler>()
would check that the common dependency Foo could be resolved on each handler resolution, for a total of four checks of the same type.If, continuing with the class definitions from the previous example, we remember that each check of Foo causes 6 tests to occur, followed by another 6 during the actual construction of the object, then we could expect this call to ResolveAll to cause 6 * (4 + 1) test for a total of 30 tests.
Unusable Constructors
There's still one more case that leads to test-multiplication: if the most preferable constructor (the one requiring the most arguments) has a dependency that cannot be resolved via IoC, such as for the case where it is generated by a factory or a deserializer.
If we were to update our original Foo class to now have a constructor that requires an unresolvable type "Bang", the effect is that we manage to double the test performed: TinyIoC will test Bar, Baz, and Qux successfully before finding that Bang cannot be constructed. With the knowledge that this constructor cannot be resolved, TinyIoC will then try the next constructor in line, attempting a second time to resolve Bar, Baz, and Qux again, not remembering that we've already found them to be resolvable. The effect here is (4 + 1) * (6 + 6 + 1): Each of the 4 IFooEventHandlers requiring the chain be tested, followed by one final test in during construction in order to find the appropriate constructor. The result, given these simplistic classes and their hierarchy, is 65 tests even though there are only 10 constructors defined between the 9 classes.
This last case is the easiest to handle, as the addition of an attribute on the constructor can tell the TinyIoC which constructor to use, but that only works for first-party code and in many cases creates an awareness of TinyIoC from the objects being constructed that it did not already have.
Summary
Given these three very real and very common patterns, the effective number of checks can be in the millions in a larger codebase, taking 20-30 seconds per request, depending on the number of handlers needing construction.
Resolution
The resolution for this issue is rather straightforward: after testing a Type or constructor, the result should be stored so that it does not have to be recalculated.
My first approach was to store the value within TinyIoC so that separate calls to Resolve or ResolveAll would be able to use a common cache. This turned out to be non-practical though, since calls to these methods can specify extra parameters that affect resolution, such as named resolution. Using a global cache would require that either named resolutions were not cached or that logic was complicated to cache them while preventing a memory leak.
Given that limitation, I took the approach of creating a cache that gets passed around internally over the lifetime of the Resolve call and gets populated as tests occur. This means that repeated calls to Resolve will perform the same checks, but that each check will only be performed a single time within a Resolve, changing what could be quadradic growth of time into linear time.
I verified that this fixed the issue that I was seeing and I also ran the unit tests to make sure that this did not introduce a regression.