Wise Disk Cleaner 10.26 - Crack Key For U

January 12, 2022 / Rating: 4.6 / Views: 733

Gallery of Images "Wise Disk Cleaner 10.26 - Crack Key For U" (17 pics):

Wise Disk Cleaner - Free Disk Cleanup and Defrag Tool

Wise Disk Cleaner will automatically clean up the useless files in the background when the scheduled time is up. In settings, you can also create a ‘Clean with 1-click’ icon and place it on the desktop. With this feature, you can clean up the junk files by just clicking the icon without opening Wise Disk Cleaner.

Wise Disk Cleaner 10.26  - Crack Key For U
NET Core 2.0 was being released, I wrote Performance Improvements in . NET Core 3.0, followed by Performance Improvements in . That made this August a little daunting, as I sat down to write this post and was faced with the list I’d curated of almost 550 PRs. Such “sync-over-async” code, which often manifests as calling an async method and then blocking waiting on the returned task (e.g. NET Core to highlight the quantity and quality of performance improvements finding their way into . With its very positive reception, I did so again a year later with Performance Improvements in . Don’t worry, I don’t cover all of them here, but grab a large mug of your favorite hot beverage, and settle in: this post takes a rip-roarin’ tour through ~400 PRs that, all together, significantly improve . ) is invariably considered a no-no in production code meant to be scalable, but sometimes it’s unavoidable, e.g. I at least peruse if not review in depth the vast majority of all those PRs, and every time I see a PR that is likely to impact performance, I make a note of it in a running log, giving me a long list of improvements I can revisit when it’s blog time. That yielded a net6perf.csproj, which I then overwrote with the following contents; most importantly, this includes multiple target frameworks so that I can use Benchmark Dot Net to easily compare performance on them: using Benchmark Dot Net. Such a situation might emerge when a system is flooded with “sync over async” work, a term coined to mean kicking off asynchronous work and then synchronously blocking waiting for it to complete; in the common case, such an anti-pattern ends up blocking one thread pool thread that depends on another thread pool thread doing work in order to unblock the first, and that can quickly result in all thread pool threads being blocked until enough have been injected to enable everyone to make forward progress. NET 6, there have been over 6500 merged PRs (pull requests) into the branch for the release, and that’s excluding automated PRs from bots that do things like flow dependency version updates between repos (not to discount the bots’ contributions; after all, they’ve actually received interview offers by email from recruiters who just possibly weren’t being particularly discerning with their candidate pool). if work items aren’t completing because, say, all of the threads in the pool are blocked, hill climbing becomes temporarily useless, and the only mechanism for injecting additional threads is the starvation mechanism, which is (by design) fairly slow. Since its main branch forked a year or so ago to be for . However, the hill climbing mechanism has a weakness: in order to properly do its job, work items need to be completing… Second, it has a hill climbing algorithm that is constantly seeking to maximimize work item throughput by manipulating available thread count; after every N work item completions, it evaluates whether adding or removing a thread to/from circulation helps or hurts work item throughput, thereby making it adaptive to the current needs of the system. Too few threads, and you can grind a system to a halt, as work items aren’t getting processed fast enough or, worse, running work items are blocked waiting for other work items to run but without enough additional threads to run them. NET has multiple mechanisms in place for determining how many threads should be in play at any point in time. This mechanism is a fairly straightforward gate that kicks in once or twice a second and checks to see whether any progress has been made on removing items from the pool’s queues: if progress hasn’t been made, meaning nothing has been dequeued, the pool assumes the system is starved and injects an additional thread. We can write a functional, trivial thread pool, well, trivially: Boom, functional thread pool. Too many threads, and you can grind a system to a halt, as all threads are fighting for the system’s resources, adding huge overheads with context switching, and getting in each other’s way with cache thrashing. But sometimes, performance optimizations are about finding ways to help bad-but-common code be a little less bad. To do that, at its core a thread pool needs two things: a queue of work to be processed, and a set of threads to process them. The hardest part of a good thread pool is in the management of the threads, and in particular determining at any given point how many threads should be servicing the queue of work. Sometimes performance optimizations are about eliminating unnecessary work, or making tradeoffs that optimize for the common case while slightly pessimizing niche cases, or taking advantage of new lower-level capabilities to do something faster, or any number of other things. And now we are getting into the idea of regions anyway, which is to have much smaller amounts of memory being manipulated separately by the GC.”. Of course you might ask, “why don’t you just decommit the middle of a segment that has large free spaces? But that requires bookkeeping to remember which parts in the middle of a segment are decommitted so we need to re-commit them when we want to use them to allocate objects. And if you have pinning that prevents the GC from retracting the end of the segment, then we can only form free spaces and free spaces are always committed memory. We do decommit on a segment but only the end of the segment which is after the very last live object on that segment (denoted by the light gray space at the end of each segment). Same with free spaces in gen2 and LOH – you might have some free spaces in gen2, it would be nice to use them to allocate some large objects. Wouldn’t it be nice if we could use those free spaces for other generations if they need them! ” “[Imagine] a scenario where we have free spaces in one generation, say gen0 because there’s some async IO going on that caused us to demote a bunch of pins in gen0, that we don’t actually use (this could be due to not waiting for so long to do the next GC or we’d have accumulated too much survival which means the GC pause would be too long). Sometimes, though, improvements are much more focused, with a change intended to improve the code generated for a very specific pattern. So you might ask, “so they are smaller, why is that significant? Math docs, editing it to create a method that computes the height of a trapezoid. NET 6 has dropped to nothing, and the code size has dropped from 179 bytes to 12. Because the entire operation became a single constant. Most of the improvements we’ve talked about thus far are cross-cutting. Regions are much smaller units, they are by default 4MB each. Here’s a benchmark I created by copying some of the sample code from the System. Further, for Ready To Run, dotnet/runtime#42831 from @nathan-moore ensured that the of an array created from a constant could be propagated as a constant. Segments are large units or memory – on Server GC 64-bit if the segment sizes are 1GB, 2GB or 4GB each (for Workstation it’s much smaller – 256MB) on SOH. No Inlining)] private static void Used When Cool Not Enabled() [Benchmark] public void Call Correct Method() ; . Equals(Int32, Int32) mov rcx,1503FF62D58 mov rcx,[rcx] test rcx,rcx jne short M00_L00 mov rcx,1503FF62D58 mov rcx,[rcx] M00_L00: mov r11,7FFE420C03A0 mov rax,[7FFE424403A0] jmp rax ; Total bytes of code 51 ; . Equals(Int32, Int32) mov rcx,1B4CE6C2F78 mov rcx,[rcx] mov r11,7FFE5AE60370 mov rax,[7FFE5B1C0370] jmp rax ; Total bytes of code 33 methods can now participate in constant folding, so if their inputs end up as constants for whatever reason, the results can become constants as well, and with constant propagation, this leads to the potential for serious reduction in run-time evaluation. Get Height() sub rsp,38 vzeroupper vmovsd xmm0,qword ptr [7FFE66C31CA0] vmovsd xmm1,qword ptr [7FFE66C31CB0] call System. Pow(Double, Double) vmovsd qword ptr [rsp 28],xmm0 vmovsd xmm0,qword ptr [7FFE66C31CC0] vmovsd xmm1,qword ptr [7FFE66C31CD0] call System. Pow(Double, Double) vmovsd xmm2,qword ptr [rsp 28] vsubsd xmm3,xmm2,xmm0 vmovsd qword ptr [rsp 30],xmm3 vmovsd xmm0,qword ptr [7FFE66C31CE0] vmovsd xmm1,qword ptr [7FFE66C31CF0] call System. Pow(Double, Double) vaddsd xmm2,xmm0,qword ptr [rsp 30] vmovsd qword ptr [rsp 30],xmm2 vmovsd xmm0,qword ptr [7FFE66C31D00] vmovsd xmm1,qword ptr [7FFE66C31D10] call System. Pow(Double, Double) vaddsd xmm1,xmm0,qwor44562d ptr [rsp 30] vsubsd xmm1,xmm1,qword ptr [7FFE66C31D20] vdivsd xmm0,xmm1,[7FFE66C31D30] vmovsd xmm1,qword ptr [7FFE66C31D40] call System. Pow(Double, Double) vmovsd xmm2,qword ptr [rsp 28] vsubsd xmm0,xmm2,xmm0 vsqrtsd xmm0,xmm0,xmm0 add rsp,38 ret ; Total bytes of code 179 which is just returning a constant double value. There were additional folding-related improvements. dotnet/runtime#48568 from @Single Accretion improved the handling of unsigned comparisons as part of constant folding and propagation; dotnet/runtime#47133 from @Single Accretion changed in what phase of the JIT certain folding is performed in order to improve its impact on inlining; and dotnet/runtime#43567 improved the folding of commutative operators. Const() mov rax,12AE3A09B48 mov rax,[rax] test rax,rax jne short M00_L00 xor ecx,ecx xor r8d,r8d jmp short M00_L01 M00_L00: cmp [rax],eax cmp [rax],eax add rax,0C mov rcx,rax mov r8d,0B M00_L01: mov [rdx],rcx mov [rdx 8],r8d mov rax,rdx ret ; Total bytes of code 53 ; . Const() mov rax,18030C4A038 mov rax,[rax] add rax,0C mov [rdx],rax mov dword ptr [rdx 8],0B mov rax,rdx ret ; Total bytes of code 31, Int32) sub rsp,28 mov [rsp 30],rcx movzx eax,byte ptr [rsp 30] test eax,eax je short M01_L00 test eax,eax je short M01_L01 cmp [rsp 34],edx setg al movzx eax,al add rsp,28 ret M01_L00: xor eax,eax add rsp,28 ret M01_L01: call System. Throw Invalid Operation Exception_Invalid Operation_No Value() int 3 ; Total bytes of code 50 ; . @maoni0 is shepherding this effort and has already written on the topic; I encourage reading her post Put a DPAD on that GC! But here are a few key statements from her post to help shed some light on the terminology: “So what are the key differences between segments and regions? No Inlining)] private static void Used When Cool Enabled() [Method Impl(Method Impl Options. The initial commit for regions is in dotnet/runtime#45172, with over 30 PRs since expanding on it. For example, given this benchmark: private static readonly bool s_cool Feature Enabled = Get Cool Feature Enabled(); private static bool Get Cool Feature Enabled() [Method Impl(Method Impl Options. NET 6 on the GC (garbage collector), the vast majority of which has been in the name of switching the GC implementation to be based on “regions” rather than on “segments”. Consider the s initialized in tier 0 can become consts in tier 1. When code reading that static field is recompiled in tier 1, the Boolean value can be considered a constant, enabling branches based on that value to be trimmed away. Div() mov ecx,[rcx 8] mov edx,24924925 mov eax,ecx mul edx sub ecx,edx shr ecx,1 lea eax,[rcx rdx] shr eax,2 ret ; Total bytes of code 23 ; . Div() mov eax,[rcx 8] mov rdx,492492492493 mov eax,eax mul rdx mov eax,edx ret ; Total bytes of code 21; . Is Even(Int32) mov eax,edx shr eax,1F add eax,edx and eax,0FFFFFFFE sub edx,eax sete al movzx eax,al ret ; Total bytes of code 19 ; . Is Even(Int32) test dl,1 sete al movzx eax,al ret ; Total bytes of code 10; . Sum() sub rsp,28 mov rax,15434127338 mov rax,[rax] cmp dword ptr [rax 8],4 jl short M00_L00 mov rdx,rax mov ecx,[rdx 8] cmp ecx,0 jbe short M00_L01 mov edx,[rdx 10] mov r8,rax cmp ecx,1 jbe short M00_L01 add edx,[r8 14] mov r8,rax cmp ecx,2 jbe short M00_L01 add edx,[r8 18] cmp ecx,3 jbe short M00_L01 add edx,[rax 1C] mov eax,edx add rsp,28 ret M00_L00: xor eax,eax add rsp,28 ret M00_L01: call CORINFO_HELP_RNGCHKFAIL int 3 ; Total bytes of code 89 ; . Sum() mov rax,28B98007338 mov rax,[rax] mov edx,[rax 8] cmp edx,4 jl short M00_L00 mov rdx,rax mov edx,[rdx 10] mov rcx,rax add edx,[rcx 14] mov rcx,rax add edx,[rcx 18] add edx,[rax 1C] mov eax,edx ret M00_L00: xor eax,eax ret ; Total bytes of code 48; . Are Zero(Int32, Int32) test ecx,ecx jne short M01_L00 test edx,edx sete al movzx eax,al ret M01_L00: xor eax,eax ret ; Total bytes of code 16 ; . Are Zero(Int32, Int32) or edx,ecx sete al movzx eax,al ret ; Total bytes of code 9 At this point, I’ve spent a lot of blog real estate writing a love letter to the improvements made to the JIT in . There’s still a lot more, but rather than share long sections about the rest, I’ll make a few final shout outs here: There’s been a lot of work happening in . dotnet/runtime#49930 teaches it how to fold null checks when used with constant strings, which as in the previous example, is most valuable with inlining. For example, you might find code that dynamically enables or disables a feature based on an environment variable and then stores the result of that into a . But, the JIT needs to be taught what kinds of things can be folded. The methods being changed in this PR are expected to be called from C# compiler-generated code with literals, and being able to specialize based on the length of the string literal passed effectively enables multiple implementations of the method the JIT can choose from based on its knowledge of the literal used at the call site, resulting in faster and smaller code when such a literal is used. NET 5 that introduced the ability to treat as a constant highlighted thousands of bytes of improvements in the generated assembly code across the core libraries. NET 6 that makes extra-special use of this is dotnet/runtime#57217. ” And the answer is, absolutely, in lots of situations; the PR in . Let’s say I have this non-trivial set of helper methods: . Of course, you may be asking yourself, “Does code really call such methods with literals? Folding can happen at various levels of compilation. Constant folding also goes hand-in-hand with “constant propagation,” which is the practice of the compiler substituting a constant value into an expression, at which point compilers will often be able to iterate, apply more constant folding, do more constant propagation, and so on. Ok, we’ve talked about inlining optimizations, bounds checking optimizations, and loop optimizations. “Constant folding” is simply a fancy term to mean a compiler computing values at compile-time rather than leaving it to run-time. The condition fails, and we’re already at the end of the loop, so we don’t jump back to the beginning and instead just keep running past the loop. Both dotnet/runtime#50982 and dotnet/runtime#52347 improved the JIT’s support for loop inversion. Now imagine that is incremented, and we check the condition. The inverted structure also often affords additional optimizations; for example, the JIT’s pattern recognition used for loop cloning and the hoisting of invariants depend on the loop being in an inverted form. Consider a loop like: In other words, change the while into a do..while, moving the condition check from the beginning of each iteration to the end of each iteration, and then add a one-time condition check at the beginning to compensate. “Loop inversion” is a standard compiler transform that’s aimed at eliminating some branching from a loop. Get Length() push rsi sub rsp,30 vzeroupper vxorps xmm0,xmm0,xmm0 vmovdqu xmmword ptr [rsp 20],xmm0 mov dword ptr [rsp 20],5 mov dword ptr [rsp 24],6 mov dword ptr [rsp 28],7 mov rcx,offset MT_System. Since we’re on the topic of loop optimization, consider loop inversion. For example, consider a function like the following, which you might find in a collection implementation: interface here doesn’t matter: I just selected an arbitrary value type that implements an interface. NET Core enabled the JIT to avoid the boxing operation here (from casting a value type to an interface it implements) and emit this purely as a constrained method call, and then a subsequent release enabled it to be devirtualized and inlined: ; Program. That can include something as simple as the type of the array itself: change the previous example to use ; . Sum() sub rsp,28 mov rax,[rcx 8] xor edx,edx xor ecx,ecx mov r8d,[rax 8] M00_L00: cmp ecx,r8d jae short M00_L01 movsxd r9,ecx movzx r9d,byte ptr [rax r9 10] add edx,r9d inc ecx cmp ecx,0F423F jl short M00_L00 add rsp,28 ret M00_L01: call CORINFO_HELP_RNGCHKFAIL int 3 ; Total bytes of code 54 ; . Sum() sub rsp,28 mov rax,[rcx 8] xor edx,edx xor ecx,ecx test rax,rax je short M00_L01 cmp dword ptr [rax 8],0F423F jl short M00_L01 nop word ptr [rax rax] M00_L00: movsxd r8,ecx movzx r8d,byte ptr [rax r8 10] add edx,r8d inc ecx cmp ecx,0F423F jl short M00_L00 jmp short M00_L02 M00_L01: cmp ecx,[rax 8] jae short M00_L03 movsxd r8,ecx movzx r8d,byte ptr [rax r8 10] add r8d,edx mov edx,r8d inc ecx cmp ecx,0F423F jl short M00_L01 M00_L02: add rsp,28 ret M00_L03: call CORINFO_HELP_RNGCHKFAIL int 3 ; Total bytes of code 97 Not just bytes, but the same issue manifests for arrays of non-primitive structs. Additionally, dotnet/runtime#55299 improved loop cloning for various loops over multidimensional arrays. Once devirtualized, the target may also be inlined (subject to all of the same rules and heuristics), in which case it can avoid not only the virtual dispatch overhead, but also potentially benefit from the further optimizations inlining can enable. Loop cloning enables the JIT to generate code that’s more like the equivalent of this: We end up paying for the extra up-front one time checks, but as long as there’s at least a couple of iterations, the elimination of the bounds check pays for that and more. However, as with other bounds checking removal optimizations, the JIT is looking for very specific patterns, and things that deviate and fall off the golden path lose out on the optimization. Inlining also goes hand-in-hand with devirtualization, which is the act in which the JIT takes a virtual or interface method call, determines statically the actual end target of the invocation, and emits a direct call to that target, saving on the cost of the virtual dispatch. Another bounds-checking optimization that’s arguably a category of its own is “loop cloning.” The idea behind loop cloning is the JIT can duplicate a loop, creating one variant that’s the original and one variant that removes bounds checking, and then at run-time decide which to use based on an additional up-front check. However, doing such a bounds check on each iteration of the loop adds an extra comparison and branch on each iteration. whether the callee could benefit from folding if handed constants; and by teaching the inliner how to inline various constructs it previously considered off-limits, e.g. Let’s take just one example from a comment on one of those PRs: . Date Time) sub rsp,28 mov rax,[rcx] mov ecx,[rcx 8] test ecx,ecx jbe short M01_L00 cmp ecx,0 jbe short M01_L01 mov rcx,0FFFFFFFFFFFF and rdx,rcx mov [rax],rdx M01_L00: add rsp,28 ret M01_L01: call CORINFO_HELP_RNGCHKFAIL int 3 ; Total bytes of code 46 ; . For example, dotnet/runtime#42295 special-cases indexing into an array with a constant 0 index (which is actually fairly common) and emits a instruction, which makes the code both slightly smaller and slightly faster. In this light, dotnet/runtime#50675, dotnet/runtime#51124, dotnet/runtime#52708, dotnet/runtime#53670, and dotnet/runtime#55478 improved the JIT by helping it to understand (and more efficiently understand) what methods were being invoked by the callee; by teaching the inliner about new things to look for, e.g. This PR fixes that, the effect of which is evident in this benchmark , System. Span~1 In other cases, it’s not about whether there’s a bounds check, but what code is emitted for a bounds check that isn’t elided. Net net, inlining is hugely powerful, but also something to be employed carefully, and the JIT methodically (but necessarily quickly) weighs decisions it makes about what to inline and what not to with a variety of heuristics. In previous releases, there was an issue in the JIT where an inlined method call could cause subsequent bounds checks that otherwise would have been removed to now no longer be removed. There’s also the impact on the JIT compiler itself, as the JIT has limits on things like the size of a method before it’ll give up on optimizing further; inline too much code, and you can exceed said limits. Another bounds-checking improvement comes in dotnet/runtime#49271 from @Single Accretion. But give each of those places their own (likely mutated) copy, and as far as the hardware is concerned, that’s different code, meaning the inlining can result in code actually running slower due to forcing more evictions and loads from and to that cache. Core Lib]] call CORINFO_HELP_NEWSFAST mov rsi,rax vmovdqu xmm0,xmmword ptr [rsp 20] vmovdqu xmmword ptr [rsi 8],xmm0 mov rcx,rsi call Program. Object) mov rcx,rsi add rsp,30 pop rsi jmp near ptr System. dotnet/runtime#49450 from @Single Accretion) in the core libraries that had previously been done to work around the lack of the bounds checking removal in such cases. If you have 1000 bytes of code that you invoke from 100 different places, each of those places can potentially reuse the bytes previously loaded into the cache. Those changes then also allowed undoing some hacks (e.g. Computers use very fast and limited size instruction caches to store code to be run. NET 6 makes that a whole lot better, thanks to the aforementioned PRs. NET 5, this benchmark would result in assembly like the following: ) sub rsp,28 mov rax,[rcx] mov edx,[rcx 8] cmp edx,4 jl short M01_L00 cmp edx,0 jbe short M01_L01 mov word ptr [rax],74 cmp edx,1 jbe short M01_L01 mov word ptr [rax 2],72 cmp edx,2 jbe short M01_L01 mov word ptr [rax 4],75 cmp edx,3 jbe short M01_L01 mov word ptr [rax 6],65 mov eax,1 add rsp,28 ret M01_L00: xor eax,eax add rsp,28 ret M01_L01: call CORINFO_HELP_RNGCHKFAIL int 3 Much better. But it also has other potentially deleterious affects. In the past, the JIT has been a little finicky about which guard patterns here are recognized and which aren’t, and . That means more memory consumption for the assembly code, and if it was AOT-compiled, more size on disk. Consider the following benchmark: This represents relatively typical code you might see in some lower-level formatting, where the length of a span is checked and then data written into the span. Let’s say all of the code associated with will now end up carrying a (potentially optimized with knock-on effects) copy of the code; call it from 100 different locations, and you now have 100,000 bytes of assembly code rather than 1,000 that are reused. The JIT can see that, by construction, the loop can’t walk off either end of the array, and thus it needn’t emit a bounds check. NET sees the JIT become wise to more and more patterns where it can safely eliminate bounds checking, and . dotnet/runtime#40180 and dotnet/runtime#43568 from @nathan-moore are great (and very helpful) examples. That can make microbenchmarks look very good in some circumstances, but it can also have some bad net effects. As such, the JIT goes to great lengths to avoid emitting bounds checking when it can prove going out of bounds isn’t possible. Sum(Int32[]) xor eax,eax xor ecx,ecx mov r8d,[rdx 8] test r8d,r8d jle short M02_L01 M02_L00: movsxd r9,ecx add eax,[rdx r9*4 10] inc ecx cmp r8d,ecx jg short M02_L00 M02_L01: ret ; Total bytes of code 29 instruction at the end of the method; that’s because no call to a throw helper is required here, as there’s no bounds checking needed. If you inline too much, you bloat the code in your methods, potentially very significantly. NET libraries do rely on avoiding this kind of overhead wherever possible. The major benefit is it exposes the contents of the callee to the context of the caller, enabling subsequent (“knock-on”) optimizations that wouldn’t have been possible without the inlining. In other words, we didn’t just save the method call, we also transformed the entire operation into a constant. Of course, you also need to be careful with inlining. And you certainly rely on code where it’s not: a lot of lower-level routines in the core . Compression; #if NETCOREAPP3_0_OR_GREATER using System. This avoids the overhead of the method call, but that’s typically only a minor benefit. For most code, the overhead is negligible, but if you’re reading this post, there’s a good chance you’ve written code where it’s not. “Inlining” is the process by which the compiler takes the code from a method callee and emits it directly into the caller. That comparison is the “bounds check.” Of course, such bounds checks add overhead. NET code, let’s start by talking about inlining and devirtualization. Get Length() push rsi sub rsp,30 vzeroupper vxorps xmm0,xmm0,xmm0 vmovupd [rsp 20],xmm0 mov dword ptr [rsp 20],5 mov dword ptr [rsp 24],6 mov dword ptr [rsp 28],7 mov rcx,offset MT_System. There are other categories of optimization critical to high-performance C# and . That’s because all accesses to arrays, strings, and spans are automatically “bounds checked” by the JIT, meaning it ensures before indexing into one of these data structures that the index is properly within bounds. If the index is equal to or greater than the array length, it jumps to the end of the method, which calls a helper that throws an exception. Simply ensure that you have installed each runtime you want to measure. NET 6 RC1, along with the latest released downloads of . Since it’s so foundational to good performance in . ITuple.get_Length() ; Total bytes of code 92; Program. The list will continue to grow over time as the JIT learns more and more how to, well, learn. Here are just a few: That’s probably enough for now about inlining. by using the classes, etc.), it’s near impossible to experience typical security vulnerabilities like buffer overruns. However, unless otherwise called out, all of these benchmarks show comparable improvements when run on Linux or on mac OS. Final note and standard disclaimer: microbenchmarking can be very subject to the machine on which a test is run, what else is going on with that machine at the same time, and sometimes seemingly the way the wind is blowing. As such, improvements to code generation have a multiplicative effect, with the power to improve the performance of all code that runs on the platform. NET 6 sees an unbelievable number of performance improvements finding their way into the JIT (just-in-time compiler), which is used to translate IL (intermediate language) into assembly code at run-time, and which is also used for AOT (ahead-of-time compilation) as part of Crossgen2 and the R2R format (Ready To Run). ones that need to search ancestor hierarchies or interface lists or that can handle generic co- and contra-variance). NET is that, unless you go out of your way to circumvent the protections put in place (e.g. NET Framework 4.8 could be included in the result set. Code generation is the foundation on top of which everything else is built. It can optimize for type casts, emitting checks that do a direct type comparison against the actual object type rather than always relying on more complicated and expensive cast helpers (e.g. Most of the results in the post were generated by running on Windows, primarily so that . inlining, loop cloning, etc.) in blocks discovered to be cold. Serialization; #endif [Disassembly Diagnoser(max Depth: 1)] // change to 0 for just the [Benchmark] method [Memory Diagnoser(display Gen Columns: false)] public class Program ) to highlight cases where there’s a continuous improvement release-over-release. NET 6.0, such as when highlighting the difference between an existing API and a new one in this release. It can actually reduce code size in various circumstances by choosing to not apply optimizations that might otherwise increase code size (e.g. It can perform this “guarded devirtualization” for most interface and virtual dispatch, emitting both one or more fast paths that are devirtualized and possibly inlined, with a fallback that performs the standard dispatch should the actual type not match the expected type. Given the data it knows about how the code behaves, it can be more aggressive about inlining, as it has more data about what will and won’t be profitable. Core Lib]] call CORINFO_HELP_NEWSFAST mov rcx,rax lea rsi,[rcx 8] vmovupd xmm0,[rsp 20] vmovupd [rsi],xmm0 call Program. Object) cmp [rsi],esi mov eax,3 add rsp,30 pop rsi ret ; Total bytes of code 92 jmp near ptr System. But if it is, which, based on the profiling data, is expected to be the case the vast majority of the time, it can then proceed to directly invoke the The JIT optimizes for PGO data in a variety of ways. Now at the beginning of the method, we see: : if it isn’t, then it jumps to the cold section we previously highlighted that’s performing the normal interface dispatch. Yet this is the entirety of the original implementation; if that’s cold, what’s hot? So, this interface dispatch has moved to the end of the method, as based on PGO data the JIT expects it to be cold / rarely invoked. That enables better use of instruction caches and minimizes loads necessary to bring in likely-unsed code. The framework assemblies are compiled with PGO, used to execute a stable of representative apps and services, and then the resulting data is used to generate the final code that’s part of the shipped assemblies.) Enough setup… One optimization common in PGO implementations is “hot/cold splitting”, where sections of a method frequently executed (“hot”) are moved close together at the beginning of the method, and sections of a method infrequently executed (“cold”) are moved to the end of the method. (Note that the Ready To Run code for the framework assemblies does include implementations optimized based on PGO, just “static PGO”. Move Next() sub rsp,28 mov rcx,[rcx 8] mov r11,offset MT_System. Enumerable Range Iterator cmp [rcx],r11 jne short M00_L03 mov r11d,[rcx 0C] cmp r11d,1 je short M00_L00 cmp r11d,2 jne short M00_L01 mov r11d,[rcx 10] inc r11d mov [rcx 10],r11d cmp r11d,[rcx 18] je short M00_L01 jmp short M00_L02 M00_L00: mov r11d,[rcx 14] mov [rcx 10],r11d mov dword ptr [rcx 0C],2 jmp short M00_L02 M00_L01: mov dword ptr [rcx 0C],0FFFFFFFF M00_L02: add rsp,28 ret M00_L03: mov r11,7FFF8BB50378 call qword ptr [7FFF8BEB0378] jmp short M00_L02 sequence for doing the interface dispatch still exists here, but it’s at the end of the method. To enable instrumenting the core libraries as well, you can disable Ready To Run: which enables tiering for methods that contain loops: otherwise, anything that has a backward jump goes straight to tier 1, meaning it gets optimized immediately as if tiered compilation didn’t exist, but in doing so loses out on the benefits of first going through tier 0. NET referring to “full PGO”: that’s the case of all three of these environment variables being set, as then everything in the app is utilizing “dynamic PGO”. Ready To Run images can participate in tiering, but they don’t go through a tier 0, rather they go straight from the Ready To Run code to tier 1; that means there’s no opportunity for dynamic PGO to instrument the binary for dynamically gathering insights. an accessed can become a constant, as its value will have already been computed by the time the tier 1 code is compiled (dotnet/runtime#45901 improves the aforementioned queueing, using a dedicated thread rather than using the thread pool). With “dynamic PGO,” the JIT can now do further instrumentation during tier 0, to track not just call counts but all of the interesting data it can use for profile-guided optimization, and then it can employ that during the compilation of tier 1. On top of that, there are some other environment variables you’ll also want to consider setting. NET are installed with Ready To Run images, which means they’ve essentially already been compiled into assembly code. The emitted code includes some tracking data to count how frequently methods are invoked, and once members pass a certain threshold, the JIT queues them to be recompiled in “tier 1,” this time with all the optimizations the JIT can muster, and learning from the previous compilation, e.g. To enable it, you need to set the That enables gathering all of the interesting data during tier 0. Of course, there are many situations in which it’s impossible for the JIT to statically determine the exact target for a method call, thus preventing devirtualization and inlining… However, with tiered compilation, a whole new world is available. Code starts in “tier 0,” during which the JIT compiler applies very few optimizations, which also means the JIT compiles code very quickly (optimizations are often what end up taking the most time during compilation). For example, dotnet/runtime#53567 improves devirtualization in AOT Ready To Run images, and dotnet/runtime#45526 improves devirtualization with generics such that information about the exact class obtained is then made available to improve inlining. They can then run their instrumented application through typical use, and the resulting data from the instrumentation can then be fed back into the compiler the next time around to influence how the compiler compiles the code. Traditionally, you’d build your app, run the data gathering process, and then rebuild the app feeding in the resulting data, and typically this would all be automated as part of a build pipeline; that process is referred to as “static PGO”. For JIT’d code, it represents a compromise between getting going quickly and running with highly-optimized code. A multitude of other changes have impacted devirtualization as well. PGO as a concept isn’t new; it’s been implemented in a variety of development stacks, and has existed in . The general idea behind profile-guided optimization is that a developer can first compile their app, using special tooling that instruments the binary to track various pieces of interesting data. NET 6 is something special when compared to previous releases; in particular, from my perspective, “dynamic PGO”. you’re forced to implement an interface that’s synchronous and the only means at your disposal to do so is with functionality exposed only as an async method. We can see the impact of this with a terrible repro: This queues a bunch of work items to the thread pool, all of which block waiting for a task to complete, but that task won’t complete until the final queued work item completes it to unblock all the other workers. NET 5, the above repro on my 12-logical-core machine takes ~32 seconds to complete. Thus, we end up blocking every thread in the pool, waiting for the thread pool to detect the starvation and inject another thread, which the repro then dutifully blocks, and on and on, until finally there are enough threads that every queued work item can be running concurrently. You can see the output here; pay attention to the timestamps on each work item, where you can see that after ramping up very quickly to have a number of threads equal to the number of cores, it then very slowly introduces additional threads. .51: 4 .51: 8 .51: 1 .51: 5 .51: 9 .51: 0 .51: 10 .51: 2 .51: 11 .51: 3 .51: 6 .51: 7 .52: 12 .52: 13 .53: 14 .52: 15 .52: 16 .02: 17 .02: 18 .52: 19 .51: 20 .52: 21 .52: 22 .03: 23 .02: 24 .03: 25 .01: 26 .03: 27 .02: 28 .02: 29 .52: 30 .52: 31 .52: 32 .02: 33 .02: 34 .53: 35 .51: 36 .02: 37 .02: 38 .52: 39 .52: 40 .52: 41 .52: 42 .55: 43 .52: 44 .53: 45 .52: 46 .02: 47 Done: .5128769 I’m happy to say the situation improves here for . This is not license to start writing more sync-over-async code, but rather a recognition that sometimes it’s unavoidable, especially in existing applications that may not be able to move to an asynchronous model all at once, that might have some legacy components, etc. dotnet/runtime#53471 teaches the thread pool about the most common form of blocking we see in these situations, waiting on a that hasn’t yet completed. In response, the thread pool becomes much more aggressive about increasing its target thread count while the blocking persists, and then immediately lowers the target count again as soon as the blocking has ended. NET 6, we can see that ~32 seconds drops to ~1.5 seconds, with the pool injecting threads much faster in response to the blocking. NET, the thread pool’s core dispatch routine was in managed code, but all of the logic around thread management was all still in native in the runtime. .39: 5 .39: 7 .39: 6 .39: 8 .39: 9 .39: 10 .39: 1 .39: 0 .39: 4 .39: 2 .39: 3 .47: 12 .47: 11 .47: 13 .47: 14 .47: 15 .47: 22 .47: 16 .47: 17 .47: 18 .47: 19 .47: 21 .47: 20 .50: 23 .53: 24 .56: 25 .59: 26 .63: 27 .66: 28 .69: 29 .72: 30 .75: 31 .78: 32 .81: 33 .84: 34 .91: 35 .97: 36 .03: 37 .10: 38 .16: 39 .22: 40 .28: 41 .35: 42 .41: 43 .47: 44 .54: 45 .60: 46 .68: 47 Done: .3649530 Interestingly, this improvement was made easier by another large thread pool related change in . All of that logic was ported to C# previously in support of Core RT and mono, but it wasn’t used for coreclr. NET 6 and dotnet/runtime#43841, it now is used everywhere. This should make further improvements and optimizations easier and enable more advancements in the pool in future releases. Moving on from the thread pool, dotnet/runtime#55295 is an interesting improvement. One of the things you find a lot in multithreaded code, whether direct usage in low-lock algorithms or indirect usage in concurrency primitives like locks and semaphores, is spinning. Spinning is based on the idea that blocking in the operating system waiting for something to happen is very efficient for longer waits but incurs non-trivial overheads at the start and end of the waiting operation; if the thing you’re waiting for will likely happen very, very soon, you might be better off just looping around to try again immediately or after a very short pause. My use of the word “pause” there is not coincidental, as the x86 instruction set includes the “PAUSE” instruction, which tells the processor the code is doing a spin-wait and helps it to optimize accordingly. However, the delay incurred by the “PAUSE” instruction can varely greatly across processor architectures, e.g. it might take only 9 cycles on an Intel Core i5, but 65 cycles on an AMD Ryzen 7, or 140 cycles on an Intel Core i7. That makes it challenging for tuning the behavior of higher-level code written using spin loops, which core code in the runtime and key concurrency-related types in the core libraries do. To address this discrepancy and provide a consistent view of pauses, previous releases of . NET have tried to measure at startup the duration of pauses, and then used those metrics to normalize how many pauses are used when one is needed. While the measurement wasn’t being done on the main thread of the startup path, it was still contributing milliseconds of CPU time to every process, a number that can add up over the millions or billions of . It also was only done once for a process, but for a variety of reasons that overhead could actually change during a process’ lifetime, for example if a VM was suspended and moved from one physical machine to another. To address this, the aforementioned PR changes its scheme. Rather than measuring once at startup for a longer period of time, it periodically does a short measurement and uses that to refresh its perspective on how long pauses take. This should lead to an overall decrease in CPU usage as well as a more up-to-date understanding of what these pauses cost, leading to a more consistent behavior of the apps and services that rely on it. Let’s move on to methods could use for synchronously-completing operations (synchronously completing async methods are counterintuitively extremely common; consider a method where the first invocation does I/O to fill a buffer, but subsequent operations simply consume from that buffer). When the C# compiler goes to implement an async method with the signature . Rather than constructing a new is used with a type that can be cached but for which the specific value is not; however, on the balance it’s worthwhile given the savings for extremely common values. Most developers should never see these types in any code they read or write. Of course, tasks are very closely tied to async methods in C#, and it’s worth looking at a small but significant feature in C# 10 and . One of the downsides to this model, however, is that which builder is selected is tied to the definition of the type being returned from the async method. So, if you want to define your async method to return , you have no way to control the builder that’s employed: it’s determined by that type and only by that type. There are a variety of reasons someone might want to control the details of the lifecycle of the task, but one of the most prominent is pooling. When an , which allows enterprising developers to plug in a custom implementation, including one that could potentially be pooled. What if, instead of using the aforementioned builders, we could author a builder that used and pooled custom . The first time an async method needed to yield and move all its state from the stack to the heap, these builders would consult the pool and try to use an object already there, only allocating a new one if one wasn’t available in the pool. Then upon , the object would be returned to the pool. That experiment is complete and the environment variable removed for . In its stead, this capability is supported in a new form in . The , but failed in one of multiple ways and varying levels of severity when pooling was enabled. Code analysis rule CA2012 is meant to help avoid such code, but that alone is insufficient to prevent such breaks. Second, as you can see from the benchmark above, while the pooling avoided the allocation, it came with a bit more overhead. And not shown here is the additional overhead in memory and working set of having to maintain the pool at all, which is maintained per async method. There are also some potential overheads not shown here, things that are common pitfalls to any kind of pooling. For example, the GC is optimized to make gen0 collections really fast, and one of the ways it can do that is by not having to scan gen1 or gen2 as part of a gen0 GC. But if there are references to gen0 objects from gen1 or gen2, then it does need to scan portions of those generations (this is why storing references into fields involves “GC write barriers,” to see if a reference to a gen0 object is being stored into one from a higher generation). Since the entire purpose of pooling is to keep objects around for a long time, those objects will likely end up being in these higher generations, and any references they store could end up making GCs more expensive; that can easily be the case with these state machines, as every parameter and local used in the method could potentially need to be tracked as such. So, from a performance perspective, it’s best to use this capability only in places where it’s both likely to matter and where performance testing demonstrates it moves the needle in the right direction. We can see, of course, that there are scenarios where in addition to saving on allocation, it actually does improve throughput, which at the end of the day is typically what one is really focusing on improving when they’re measuring allocation reduction (i.e. reducing allocation to reduce time spent in garbage collection). This is useful in replacing a fairly common pattern that shows up (and that, unfortunately, developers often get wrong) with developers wanting to wait for a task to complete but with either or both a timeout and cancellation. For example, this: private Semaphore Slim _sem = new Semaphore Slim(0, 1); private Cancellation Token Source _cts = new Cancellation Token Source(); [Benchmark] public Task With Cancellation Token() [Benchmark] public Task With Timeout() [Benchmark] public Task With Cancellation Token And Timeout() , the cancellation support in . NET 6 has also seen performance improvements, both for existing functionality and for new APIs that enable an app to do even better. One interesting improvement is dotnet/runtime#48251, which is a good example of how one can design and implement and optimize for one scenario only to find that it’s making the wrong tradeoffs. That led to a really neat (but complicated) lock-free implementation that involved quite a bit of allocation and overhead. If you were in fact registering and unregistering from the same token from lots of threads in parallel, the implementation was very efficient and resulted in good throughput. But if you weren’t, you were paying a lot of overhead for something that wasn’t providing reciprocal benefit. And, as luck would have it, that’s almost never the scenario these days. It’s much, much more common to have a that’s used serially, often with multiple registrations all in place at the same time, but with those registrations mostly having been added as part of the serial flow of execution rather than all concurrently. This PR recognizes this reality and reverts the implementation to a much simpler, lighterweight, and faster one that performs better for the vast majority use case (while taking a hit if it is actually hammered by multiple threads concurrently). private Cancellation Token Source _source = new Cancellation Token Source(); [Benchmark] public void Create Token Dispose() [Benchmark] public void Create Register Dispose() [Benchmark] public void Create Linked Token Dispose() [Benchmark(Operations Per Invoke = 1_000_000)] public void Create Many Register Dispose() [Benchmark(Operations Per Invoke = 1_000_000)] public void Create Many Register Multiple Dispose() responsible for the callback being invoked, enabling code that was instantiating a new delegate and potentially a closure in order to get access to that information to instead be able to use a cached delegate instance (as the compiler generates for lambdas that don’t close over any state). And dotnet/runtime#50346 makes it easier to reuse s are never canceled, and if they’re not canceled, there’s nothing that prevents them from continuing to be used, potentially stored into a pool for someone else to use in the future. This gets a bit tricky, however, if is used or if the constructor is used that takes a timeout, as both of those cause a timer to be created, and there are race conditions possible between the timer firing and someone checking to see whether : if it returns true, it hasn’t had cancellation requested and any underlying timer has been reset as well such that it won’t fire without a new timeout being set. If it returns false, well, don’t try to reuse it, as no guarantees are made about its state. You can see how the Kestrel web server does this, via dotnet/aspnetcore#31528 and dotnet/aspnetcore#34075. Those are some of the bigger performance-focused changes in threading. There are a myriad of smaller ones as well, for example the new ; . Compare Exchange() sub rsp,28 cmp [rcx],ecx add rcx,8 mov edx,1 xor r8d,r8d call 00007FFEC051F8B0 inc rax add rsp,28 ret ; Total bytes of code 31 ; . Compare Exchange() cmp [rcx],ecx add rcx,8 mov edx,1 xor eax,eax lock cmpxchg [rcx],rdx inc rax ret ; Total bytes of code 22 is used to provide unique identifiers for any number of things and operations. Just one small example of how the platform gets better every release. The ability to create them quickly is important, as is the ability to quickly format and parse them. One other code path, due to the former needing to be able to handle negative rendering as well, and when you’re counting nanoseconds at this low-level of the stack, such calls can make a measurable difference. In the intervening years, a myriad number of pseudo-random algorithms have emerged, and for . Previous releases have seen significant improvements on all these fronts, but they get even better in . Let’s take a simple benchmark for parsing: during parsing, as culture isn’t necessary or desired when parsing hexadecimal digits. employed the same algorithm it had been using for the last two decades, a variant of Knuth’s subtractive random number generator algorithm that dates back to the 1980s. NET 6 in dotnet/runtime#47085, we picked the was introduced with almost all of its methods virtual. And dotnet/runtime#55792 and dotnet/runtime#56210 rewrote parsing for the ‘D’, ‘B’, ‘P’, and ‘N’ formats (all but the antiquated ‘X’) to be much more streamlined, with careful attention paid to avoidance of bounds checking, how data is moved around, number of instructions to be executed, and so on. NET Core 3.1, another 1.3x on top of that going from . In addition to that leading to virtual dispatch overheads, it has additional impact on the evolution of the type: because someone could have overridden one of the methods, any new method we introduce has to be written in terms of the existing virtuals… The net result is a very nice increase in throughput: I love seeing tables like this. so, for example, when we added the span-based derived type, which overrode all the methods, and locked around each one… except for the ones unavailable at the time the derived type was created). NET 6, we check at construction time whether we’re dealing with a derived type, and fall back to the old implementation if this is a derived type, otherwise preferring to use an implementation that needn’t be concerned about such compatibility issues. Similarly, over the years we’ve been hesitant to change ‘s constructor (which is common, for example, in tests); now in . NET 6, just as for derived types, we fall back to the old implementation if a seed is supplied, otherwise preferring the new algorithm. This sets us up for the future where we can freely change and evolve the algorithm used by changes also highlight tradeoffs made in optimizations. The approach of dynamically choosing the implementation to use when the instance is constructed means we incur an extra virtual dispatch on each operation. For the case, we don’t have those algorithmic wins to offset things. As the overhead is small (on my machine 1-2ns) and as the scenarios for providing a seed are a minority use case in situations where counting nanoseconds matters (passing a specific seed is often used in testing, for example, where repeatable results are required), we accepted the tradeoff. But even the smallest, planned regressions can nag at you, especially when discussing them very publicly in a blog post, so in dotnet/runtime#57530 we mitigated most of them (basically everything other than the simplest seeded overload, which on my machine is ~4% slower in . NET 5) and even managed to turn most into improvements. This was done primarily by splitting the compat strategy implementation further into one for , which enables the former to avoid any virtual dispatch between members (and for the latter, derived types can override to provide their own completion implementation). As previously noted, a method like `Next(int, int)` delegates to another virtual method on the instance, but that virtual delegation can now be removed entirely for the seeded case as well. In addition to changes in the implementation, to search for the separators between key/value pairs, rather than using an open-coded loop. In addition to reducing the amount of code needed in the implementation, this takes advantage of the fact that is heavily optimized using a vectorized implementation. The net result is much faster retrieval of all environment variables: on my machine, with the environment variables I have in my environment, I get results like these: to provide not only simpler access to commonly-accessed information, but also much faster access. For a fundamental operation that’s used on hot paths, the extra validation that’s required in order to ensure the offset and count are in-bounds adds up. It’s pretty common for apps, for example in logging code, to want to get the current process’ ID. NET 5, code would often do something like has also seen both improvements to the performance of its existing methods (so that existing usage just gets faster) and new methods added to it (such that minor tweaks to how it’s being consumed in an app can yield further fruit). dotnet/runtime#51548 and dotnet/runtime#53388 add a new This also incurs measurable overhead every time an element is stored into an array of (most) reference types. dotnet/runtime#44355 is a small PR with a sizeable impact, improving the performance of the generic . NET 5, they utilized the same internal helpers, and ended up boxing the strongly-typed input as an implementation detail. NET Framework 4.8 at the same time), it (knowingly) regressed the performance of by around 2.5x when the Windows feature is enabled. NET 6, dotnet/runtime#50263 provides a scheme for still maintaining the leap seconds support while avoiding the impactful overhead, getting back to the same throughput as without the feature. When spans were introduced, it was recognized that if you create a writeable span, you’re very likely going to write to it, and thus if the cost of a check needs to be paid somewhere, it’s better to pay that cost once when the span is created rather than on every write into the span. This PR fixes that by adding a strongly-typed internal helper, and it tweaks what existing methods these generic methods delegate to so as to use ones that can operate faster given the strongly-typed nature of the generic methods. , respectively, by avoiding some duplicative validation, ensuring fast paths are appropriately inlined (and slow paths aren’t), and other such tweaks. But the biggest impact came from negating the regressions incurred when leap seconds support was added in . “Leap seconds” are rare, one-second adjustments made to UTC that stem from the fact that the Earth’s rotation speed can and does actually vary over time. For many apps and services, creating and manipulating arrays, strings, and spans represent a significant portion of their processing, and lot of effort goes into finding ways to continually drive down the costs of these operations. As such, , but if the check doesn’t pass, it falls back to a typical loop, writing the value into each element of the array. As long as we’re on the topic of vectorization, other support in this release has been vectorized. dotnet/runtime#44111 takes advantage of SSSE3 hardware intrinsics (e.g. This PR is another good example of “optimizations” frequently being tradeoffs, making some scenarios faster at the expense of making others slower, and needing to make a decision based on the expected frequency of these scenarios occurring. In this case, the PR improves three specific cases significantly: being searched for isn’t super frequent in the input, enabling the vectorization to pay for itself and more. If, however, it’s very frequent (like every or every other character in the input), this change can actually regress performance. Our bet, based on reviewing use cases in a variety of code bases, is this overall will be a very positive win. pointer-based code often not only makes the code simpler and safer but also faster. Part of the change involved vectorizing the “skip over all ASCII chars which don’t require encoding” logic that all of the encoders utilize, helping to yield some significant speedups in common scenarios. , which it iterates, appending to a builder as it goes. But there’s already a separate array-based code path that does two passes over the strings, one to count the size required and then another to fill in the resulting string of the required length. dotnet/runtime#44032 converts that functionality to be based on a One of the biggest string-related improvements, though, comes from the new interpolated string handler support in C# 10 and . NET 6, with new language support added in dotnet/roslyn#54692 and library support added in dotnet/runtime#51086 and dotnet/runtime#51653. If I write: with all of the parsing handled at compile-time, no additional array allocation, and no additional boxing allocations. You can see the impact of the changes with the aforementioned examples turned into a benchmark: , see the String Interpolation in C# 10 and . Performance improvements can manifest in many ways: increasing throughput, reducing working set, reducing latencies, increasing startup speeds, lowering size on disk, and so on. NET will also notice a focus on reducing allocation. This is typically a means to an end rather than a goal in and of itself, as managed allocations themselves are easily trackable / measurable and incur varying costs, in particular the secondary cost of causing GCs to happen more frequently and/or take longer periods of time. Sometimes reducing allocations falls into the category of just stopping doing unnecessary work, or doing something instead that’s way cheaper; for example, dotnet/runtime#42776 changed an eight-byte array allocation to an eight-byte stack-allocation, the latter of which is very close to free (in particular as this code is compiled with and thus doesn’t need to pay to zero out that stack-allocated space). Beyond that, though, there are almost always real tradeoffs. One common technique is pooling, which can look great on microbenchmarks because it drives down that allocation number, but it doesn’t always translate into a measurable improvement in one of the other metrics that’s actually an end goal. In fact, it can make things worse, such as if the overhead of renting and returning from the pool is higher than expected (especially if it incurs synchronization costs), if it leads to cache problems as something returned on one NUMA node ends up being consumed from another, if it leads to GCs taking longer by increasing the number of references from Gen1 or Gen2 objects to Gen0 objects, and so on. However, one place that pooling has shown to be quite effective is with arrays, in particular larger arrays of value types (e.g. being as efficient as possible, and this release sees several impactful improvements in this area. NET 6 is the for-all-intents-and-purposes removal of the upper limit on the size of arrays Ouch. That is a large cliff to fall off of, and either the developer is aware of the cliff and is forced to adapt to it in their code, or they’re not aware of it and end up having unexpected performance problems. This somewhat arbitrary limit was originally put in place before the pool had “trimming,” a mechanism that enabled the pool to drop cached arrays in response to Gen2 GCs, with varying levels of aggressiveness based on perceived memory pressure. But then that trimming was added, and the limit was never revisited… dotnet/runtime#55621 raises the limit as high as the current implementation’s scheme enables, which means it can now cache arrays up to approximately one billion elements ( Of course, pooling such arrays means it’s important that trimming works as expected, and while there’s an unending amount of tuning we could do to the trimming heuristics, the main gap that stood out had to do with how arrays in the pool are stored. With today’s implementation, the pool is divided into buckets with sizes equal to powers of two, so for example there’s a bucket for arrays with a length up to 16, then up to 32, then up to 64, and so on: requesting an array of size 100 will garner you an array of size 128. The first layer is stored in thread-local storage, where each thread can store at most one array of each bucket size. The second layer is itself split into stacks, each of which is logically associated with one core, and each of which is individually synchronized. Code renting an array first consults the thread-local storage slot, and if it’s unable to get an array from there, proceeds to examine each of the stacks, starting with the one associated with the core it’s currently running on (which can of course change at any moment, so the affinity is quite soft and accesses require synchronization). Upon returning an array, a similar path is followed, with the code first trying to return to the thread-local slot, and then proceeding to try to find space in one of the stacks. NET 5 and earlier is able to remove arrays from the stacks, and is given the opportunity on every Gen2 GC, but it will only ever drop arrays from the thread-local storage if there’s very high memory pressure. This can lead to some rarely-used arrays sticking around for a very long time, negatively impacting working set. dotnet/runtime#56316 addresses this by tracking approximately how long arrays have been sitting in thread-local storage, and enabling them to be culled regardless of high memory pressure, instead using memory pressure to indicate what’s an acceptable duration for an array to remain. On top of these changes around what can be cached and for how long, more typical performance optimizations have also been done. dotnet/runtime#55710 and dotnet/runtime#55959 reduced typical overheads for renting and returning arrays. This entailed paying attention to where and why bounds checks were happening and avoiding them where possible, ordering checks performed to prioritize common cases (e.g. a request for a size of 0), and reducing code size to make better use of instruction caches. might have issued an overlapped I/O read operation, but typically that read would end up then blocking in a sync-over-async manner, in order to avoid potential race conditions in the implementation that could otherwise result. Or when flushing its buffer, even when flushing asynchronously, those flushes would end up doing synchronous writes. Such issues often ended up defeating any scalability benefits of using asynchronous I/O while still incurring the overheads associated with asynchronous I/O (async I/O often has higher overheads in exchange for being more scalable). All of this was complicated further by the code being a tangled web difficult to unravel, in large part because it was trying to integrate a bunch of different capabilities into the same code paths: using overlapped I/O or not, buffering or not, targeting disk files or pipes, etc., with different logic for each, all interwined. Combined, this has meant that, with a few exceptions, the entirely rewritten, and in the process, all of these issues resolved. The result is a much more maintainable implementation that’s also dramatically faster, in particular for asynchronous operations. There have been a plethora of PRs as part of this effort, but I’ll call out a few. First dotnet/runtime#47128 laid the groundwork for the new implementation, refactoring to be a wrapper around a “strategy” (as in the Strategy design pattern), which then enables the actual implementation to be substituted and composed at runtime (similar to the approach discussed with switch). dotnet/runtime#48813 and dotnet/runtime#49750 then introduced the beginnings of the new implementation, splitting it apart into several strategies on Windows, one for if the file was opened for synchronous I/O, one for if it was opened for asynchronous I/O, and one that enabled buffering to be layered on top of any strategy. dotnet/runtime#55191 then introduced a Unix-optimized strategy for the new scheme. All the while, additional PRs were flowing in to optimize various conditions. dotnet/runtime#49975 and dotnet/runtime#56465 avoided an expensive syscall made on every operation on Windows to track the file’s length, while dotnet/runtime#44097 removed an unnecessary seek when setting file length on Unix. dotnet/runtime#50802 and dotnet/runtime#51363 changed the overlapped I/O implementation on Windows to use a custom, reusable , which enabled making (non-buffered) async reads and writes amortized-allocation-free when using async I/O. dotnet/runtime#55206 from @tmds used knowledge from an existing syscall being made on Unix to then avoid a subsequent unnecessary private File Stream _file Stream; private byte[] _buffer = new byte[1024]; [Params(false, true)] public bool Is Async [Params(1, 4096)] public int Buffer Size [Global Setup] public void Setup() [Global Cleanup] public void Cleanup() [Benchmark] public void Read() [Benchmark] public async Task Read Async() provides overloads that enable sync and async reading and writing, for both a single and multiple buffers at a time, and specifying the exact offset into the file at which the read or write should occur. All of these static methods accept a was upgraded from v1.0.7 in dotnet/runtime#44107 from @saucecontrol to v1.0.9. That upgrade brings with it various performance improvements, including code paths that make better use of intrinsics. Not all compression/decompression measurably benefits, but some certainly does: contract for asynchronous read operations is that, assuming you request at least one byte, the operation won’t complete until at least one byte is read; however, the contract makes no guarantees whatsoever that the operation will return all that you requested, in fact it’s rare to find a stream that does make such a guarantee, and it’s problematic in many cases when it does. Unfortunately, as an implementation detail, was in fact trying to return as much data as was requested, by issuing as many reads against the underlying stream as it needed to in order to make that happen, stopping only when it decoded a sufficient amount of data to satisfy the request or hit EOF (end of file) on the underlying stream. First, it prevents overlapping the processing of any data that may have already been received with the waiting for more data to receive; if 100 bytes are already available, but I asked for 200, I’m then forced to wait to process the 100 until another 100 are received or the stream hits EOF. Second, and more impactful, is it effectively prevents ‘s behavior here will deadlock the whole system, as it will prevent the receipt of the 100-byte message waiting for another 900 bytes or EOF that will never arrive. dotnet/runtime#53644 fixes that by enabling (and a few other streams) to return once it has data to hand back, even if not the sum total requested. This has been documented as a breaking change, not because the previous behavior was guaranteed (it wasn’t), but we’ve seen enough code erroneously depend on the old behavior that it was important to call out. The PR also fixes one more performance-related thing. One issue scalable web servers need to be cognizant of is memory utilization. If you’ve got a thousand open connections, and you’re waiting for data to arrive on each connection, you could perform an asynchronous read on each using a buffer, but if that buffer is, say, 4K, that’s 4MB worth of buffers that are sitting there wasting working set. If you could instead issue a zero-byte read, where you perform an empty read simply to be notified when there is data that could be received, you can then avoid any working set impact from buffers, only allocating or renting a buffer to be used once you know there’s data to put in it. Lots of ), which, in addition to eliminating duplicated code and helping to ensure consistency of behavior, helps to streamline the validation logic using a shared, efficient implementation of the relevant checks. It goes without saying that networking is at the heart of services and most significant apps today, and so improvements in networking performance are critical to the platform. Principal.dll), and reduced the complexity of the code, they also reduced allocation. At the bottom of the networking stack, we have implementations were depending on libraries like System. To see the impact, here’s a little microbenchmark that repeatedly establishes a new loopback connection: , which makes it cheaper to send only a portion of some array (the alternative with the existing overloads would be to allocate a new array of the desired length and copy the relevant data into it), and then dotnet/runtime#52208 also from @gfoidl added a (subsequently dotnet/runtime#53062 provided the cancellation support that had been stubbed out in the previous PR). On top of that, dotnet/runtime#55232, and then dotnet/runtime#56777 from @huoyaoyuan, reduced the overhead of these s are lifted by the compiler to live as fields on the generated state machine type for that async method, and that state machine will end up being allocated to live on the heap if the async method needs to complete asynchronously. The fewer and smaller fields we can have on these state machines, the smaller the resulting allocation will be for asynchronously completing async methods. But locals written by the developer aren’t the only reason for fields being added. Let’s take a look at an example: ‘s async APIs to use it, but that PR was subsequently reverted in dotnet/runtime#48666 upon realizing that the implementation was just queueing these calls to be executed synchronously on a limited size thread pool internal to glibc, and we could similarly employ an “async-over-sync” solution in managed code and with more control. Here “async-over-sync” is referring to the idea of implementing an asynchronous operation that’s just queueing a synchronous piece of work to be done on another thread, rather than having it employ truly asynchronous I/O all the way down to the hardware. This ends up blocking that other thread for the duration of the operation, which inherently limits scalability. It can also be a real bottleneck for something like DNS. Typically an operating system will cache some amount of DNS data, but in cases where a request is made for unavailable data, the OS has to reach out across the network to a DNS server to obtain it. If lots of requests are made concurrently for the same non-cached address, that can starve the pool with all of the operations performing the exact same request. To address this, dotnet/runtime#49171 implements that async-over-sync in in a way that asynchronously serializes all requests for the same destination; that way, if bursts do show up, we only end up blocking one thread for all of them rather than one thread for each. This adds a small amount of overhead for individual operations, but significantly reduces the overhead in the bursty, problematic scenarios. In the future, we will hopefully be able to do away with this once we’re able to implement a true async I/O-based mechanism on Unix, potentially implemented directly on can return both IPv4 and IPv6 addresses, but if you know you only care about one or the other, you can now be explicit about it. Doing so can save on both the amount of data transferred and the resulting allocations to hand back that data. Get Host Name(); [Benchmark(Operations Per Invoke = 1000, Baseline = true)] public async Task Get Host Addresses() [Benchmark(Operations Per Invoke = 1000)] public async Task Get Host Addresses_One Family() Of course, not all URLs contain pure ASCII. Such cases often involve escaping these characters using percent-encoding, and dotnet/runtime#32552 optimized those code paths by changing a multi-pass scheme that involved both a temporary As noted previously, I love seeing this continual march of progress, with every release the exact same API getting faster and faster, as more and more opportunities are discovered, new capabilities of the underlying platform utilized, code generation improving, and on. Now we get to , was improved from a performance perspective (and many more from a functionality perspective, including preview support for HTTP/3, better standards adherence, distributed tracing integration, and more knobs for configuring how it should behave). Previous releases saw a lot of effort applied to driving down the overheads of the HTTP stack, but the public API for headers forced a particular set of work and allocations to be performed. Tcp); private static readonly Http Message Invoker s_client = new Http Message Invoker(new Http Client Handler ); private static Uri s_uri; [Global Setup] public void Create Socket Server() property (dotnet/runtime#53555), which returns a view over the headers collection that does not force parsing or validation when reading/enumerating. Even within those constraints, we’ve driven down some costs, such as by no longer forcing headers added into the [Benchmark(Baseline = true)] public async Task Enumerate() private static readonly Socket s_listener = new Socket(Address Family. This has both a functional and a performance benefit. Functionally, it means headers sent by a server can be inspected in their original form, for consumers that really want to see the data prior to it having been sanitized/transformed by . But from a performance perspective, it has a significant impact, as it means that a) the validation logic we’d normally run on headers can be omitted entirely, and b) any allocations that would result from that validation are also avoided. Now if we run So, even with all the I/O and HTTP protocol logic being performed, tweaking the API used for header enumeration here results in an ~20% boost in throughput. Another area that saw significant improvement was in ‘s connection pooling. One change here comes in dotnet/runtime#50545, which simplifies the code and helps on all platforms, but in particular improves a long-standing potential performance issue on Windows (our Unix implementation generally didn’t suffer the same problem, because of differences in how asynchronous I/O is implemented). maintains a pool of connections that remain open to the server and that it can use to service future requests. By default, it needs to scavenge this pool periodically, to close connections that have been around for too long or that, more relevant to this discussion, the server has chosen to close. To determine whether the server has closed a connection, we need to poll the underlying socket, but in some situations, we don’t actually have access to the underlying socket in order to perform the poll (and, with the advent of involved at all). In such situations, the only way we can be notified of a connection being closed is to perform a read on the connection. Thus, if we were unable to poll the socket directly, we would issue an asynchronous read (which would then be used as the first read as part of handling the next request on that connection), and the scavenging logic could check the task for that read to see whether it had completed erroneously. On Windows, overlapped I/O read operations often involve pinning a buffer for the duration of the operation (on Unix, we implement asynchronous reads via epoll, and no buffer need be pinned for the duration); that meant if we ended up with a lot of connections in the pool, and we had to issue asynchronous reads for each, we’d likely end up pinning a whole bunch of sizeable buffers, leading to memory fragmentation and potentially sizeable working set growth. Rather than issuing the actual read using the connection’s buffer, we instead issue a read using an empty buffer. All of the streams ) support the notion of zero-byte reads, where rather than returning immediately, they instead wait to complete the asynchronous read until at least some data is available, even though they won’t be returning any of that data as part of the operation. Then, only once that operation has completed, the actual initial read is issued, which is both necessary to actually get the first batch of response data, but also to handle arbitrary was potentially itself holding onto a valuable buffer, and (on Windows) pinning it, even though that was unnecessary. dotnet/runtime#49123 addresses that by special-casing zero-byte reads to not use a buffer and to not force an internal buffer into existence if one isn’t currently available ( calls, making better use of buffer space to reduce frequency of calls. dotnet/runtime#51320 from @benaadams helped avoid some unnecessary buffer growth after he noticed some constants related to TLS frame sizes that had been in the code for a long time were no longer sufficient for newer TLS protocols, and dotnet/runtime#51324 also from @benaadams helped avoid some casting overheads by being more explicit about the actual types being passed through the system. dotnet/runtime#53851 provides another very interesting improvement related to connection pooling. Let’s say all of the connections for a given server are currently busy handling requests, and another request comes along. Unless you’ve configured a maximum limit on the number of connections per server and hit that limit, will happily create a new connection to service your request (in the case of HTTP/2, by default per the HTTP/2 specification there’s only one connection and a limit set by the server to the number of requests/streams multiplexed onto that connection, but allows you to opt-in to using more than one connection). The question then is, what happens to that request if, while waiting for the new connection to be established, one of the existing connections becomes available? Up until now, that request would just wait for and use the new connection. With the aforementioned PR, the request can now use whatever connection becomes available first, whether it be an existing one or a new one, and whatever connection isn’t used will simply find its way back to the pool. This should both improve latency and response time, and potentially reduce the number of connections needed in the pool, thus saving memory and networking resources. NET Core 3.0 introduced support for HTTP/2, and since then the use of the protocol has been growing. This has led us to discover where things worked well and where more work was needed. One area in particular that needed some love was around using a fixed-size receive window (64KB), such that if the receive buffer wasn’t large enough to keep the network busy, the system could stall. To address that, the receive buffer needs to be large enough to handle the “bandwidth-delay product” (a network connection’s capacity multiplied by round-trip communication time). dotnet/runtime#54755 adds support for dynamically-sizing the receive window, as well as several knobs for tweaking the behavior. This should significantly help with performance in particular on networks with reasonably-high bandwidth along with some meaningful delay in communications (e.g. with geographically distributed data centers), while also not consuming too much memory. There’s also been a steady stream of small improvements to , things that on their own don’t account for much but when added together help to move the needle. For example, dotnet/runtime#54209 from @teo-tsirpanis converted a small class to a struct, saving an allocation per connection; dotnet/runtime#50487 removed a closure allocation from the in the common case where it has a media type but no additional parameters; and dotnet/runtime#45303 optimized the loading of the Huffman static encoding table used by HTTP/2. The original code employed a single, long array of tuples, which required the C# compiler to generate a very large function for initializing each element of the array; the PR changed that to instead be two blittable ) benefit both client and server. There have been a handful of small improvements in this area, such as with dotnet/runtime#49831 that saved a few hundred bytes in allocation as part of the websocket handshake by using span-based APIs to create the data for the headers used in the websocket protocol, and dotnet/runtime#52022 from @zlatanov that saved a few hundred bytes from each that was overkill for the target scenario. But there were two significant changes worth examining in more detail. The implementation for this came in dotnet/runtime#49304 from @zlatanov, providing a long-requested feature of per-message compression. Adding compression increases the CPU cost of sending and receiving, but it decreases the amount of data sent and received, which can in turn decrease the overall cost of communication, especially as networking latency increases. As such, the benefit of this one is harder to measure with Benchmark Dot Net, and I’ll instead just use a console app: in an intermediary stream that’s tracking the number of bytes read, which it then exposes for the console app to print. That way, we can see how much data ends up actually being sent. The app is downloading the complete works of Mark Twain from Project Gutenberg, such that each sent message is ~15MB. When I run this, I get results like the following: Compressed: False Bytes: 16,013,945 Time: 42ms Compressed: False Bytes: 16,013,945 Time: 13ms Compressed: False Bytes: 16,013,945 Time: 13ms Compressed: False Bytes: 16,013,945 Time: 12ms Compressed: False Bytes: 16,013,945 Time: 12ms Compressed: True Bytes: 6,326,310 Time: 580ms Compressed: True Bytes: 6,325,285 Time: 571ms Compressed: True Bytes: 6,325,246 Time: 569ms Compressed: True Bytes: 6,325,229 Time: 571ms Compressed: True Bytes: 6,325,168 Time: 571ms So, we can see that on this very fast loopback connection, the cost of the operation is dominated by the compression; however, we’re sending only a third as much data. That could be a good tradeoff if communicating over a real network with longer latencies, where the additional few hundred milliseconds to perform the compression and decompression is minimal compared to the cost of sending and receiving an additional 10MB. The second is amortized zero-allocation websocket receiving. NET Core 2.1, overloads were added to to perform in an allocation-free manner when the operation completed synchronously, which would happen if the data being received was already available. When the operation completed asynchronously, however, it would still allocate a Reflection provides a very powerful mechanism for inspecting metadata about . NET assemblies and invoking functionality in those assemblies. That mechanism can incur non-trivial expense, however. While functionality exists to avoid that overhead for repeated calls (e.g. using to get a strongly-typed delegate directly to the target method), that’s not always relevant or appropriate. As such, it’s valuable to reduce the overhead associated with reflection, which . A variety of PRs targeted reducing the overhead involved in inspecting attributes on . dotnet/runtime#54402 significantly reduced the overhead of calling via a dedicated code path rather than relegating the core logic to an existing shared method (dotnet/runtime#45292, from @benaadams as well, also removed some low-level overhead from filtering attribute records); and dotnet/runtime#54405 eliminated the allocation from Code often looks up information beyond attributes, and it can be helpful for performance to special-case common patterns. dotnet/runtime#44759 recognizes that reflection-based code will often look at method parameters, which many methods don’t have, yet will cache the array, but this would still result in an extra array for every individual method inspected. Reflection is valuable not just for getting metadata but also for invoking members. If you ever do an allocation profile for code using reflection to invoke methods, you’ll likely see a bunch of to the target method, which means it needs to validate that the arguments are of the right types the method expects… if they’re not, it could end up violating type safety by passing type A to a method that instead receives it as a completely unrelated type B, and now all use of that “B” is potentially invalid and corrupting. However, if a caller erroneously mutated the array concurrently with the reflection call, such mutation could happen after the type checks occurred, enabling type safety to be violated, anyway. So, the runtime is forced to make a defensive copy of the argument array and then validate the copy to which the caller doesn’t have access. dotnet/runtime#50814 addresses this by recognizing that most methods have at most only a few parameters, and special-cases methods with up to four parameters to instead use a stack-allocated Finally, sometimes optimizations are all about deleting code and just calling something else that already exists. dotnet/runtime#42891 just changed the implementation of one helper in the runtime to call another existing helper, in order to make (Migrating Real Proxy Usage to Dispatch Proxy provides a good description). It utilizes reflection emit to generate IL at run-time, and dotnet/runtime#47134 optimizes both that process and the generated code in such a way that it saves several hundred bytes of allocation per method invocation on a . NET release has seen the core collection types and LINQ get faster and faster. Even as a lot of the low-hanging fruit was picked in previous releases, developers contributing to . NET 6 have still managed to find meaningful improvements, some in the form of optimizing existing APIs, and some in the form of new APIs developers can use to make their own code fly. NET 6 comes from @benaadams in dotnet/runtime#41944. The PR improves the performance of creating one dictionary from another, by enabling the common case of the source dictionary and the new dictionary sharing a key comparer to copy the underlying buckets without rehashing. dotnet/runtime#49388 from @benaadams and dotnet/runtime#54611 from @Sergio0694 are examples of new APIs that developers can use with dictionaries when they want to eke out that last mile of performance. These APIs are defined on the , returning a ref to the value if the key could be found, or adding an empty entry and returning a ref to it, otherwise. These can be used to avoid duplicate lookups as well as avoid potentially expensive struct value copies. indexing also get faster, thanks to dotnet/runtime#53266 from @L2. Indexing into these collections performs a binary search through a tree of nodes, and each layer of the traversal was performing a range check on the index. But for all but the entry point check, that range validation is duplicative and can be removed, which is exactly what the PR does: , which was initially added in dotnet/runtime#46009 by @pgolebiowski and then subsequently optimized further in PRs like dotnet/runtime#48315, dotnet/runtime#48324, dotnet/runtime#48346, and dotnet/runtime#50065. Having mentioned , which enables getting the count of the number of items in an enumerable if that count can be determined quickly. This can be useful to avoid the overhead of resizes when presizing a collection that will be used to store the contents of the enumerable. Lastly, it’s somewhat rare today to see code written against instances of ), but such code does exist. We don’t need to optimize heavily for such code, but sometimes the stars align and efforts to simplify such code actually make it significantly faster as well, as is the case with dotnet/runtime#51351, which simplified the implementation of the non-generic was originally written to forcibly queue encryption and decryption operations to the thread pool. However, typical usage is actually very fast and doesn’t warrant a thread hop, and even if it wasn’t fast, guidance has evolved over the years such that now the recommendation wouldn’t be to queue, anyway. On top of that, can transform just one or multiple “blocks” of data at a time. The interface expects that input is processed in blocks of a particular fixed number of input bytes which then yield a fixed number of output bytes, e.g. and instead just want something simple and fast to handle encrypting and decrypting data you already have in memory. For that, dotnet/runtime#52510, dotnet/runtime#55184, and dotnet/runtime#55480 introduced new “one shot” method, and instance methods do expose the full spread of the type’s functionality, but there’s no actual need for it to be its own instance, as the underlying OS objects used now on all platforms are thread-safe and implemented in a scalable manner. dotnet/runtime#43221 added a static methods that use the PBKDF2 (Password-Based Key Derivation Function 2) key-derivation function to generate the requested bytes without explicitly creating an instance; this, in turn, enables the implementation to use any “one-shot” APIs provided by the underlying operating system, e.g. (dotnet/runtime#46819); using stack allocation in signature generation to avoid unnecessary allocation (dotnet/runtime#46893); using better OS APIs on mac OS for RSA/ECC keys (dotnet/runtime#52759 from @filipnavara); and avoiding closures in the interop layers of s on both Unix (dotnet/runtime#50511) and Windows (dotnet/runtime#50376, dotnet/runtime#50377). One of my favorites, simply because it eliminates an annoyance I hit now and again, is dotnet/runtime#53129 from @hrrrrustic, which adds an implementation of the generic As has been shown in this post and in those that I’ve written for previous versions, there have been literally thousands of PRs into . NET over the last several years to improve its performance. Many of these changes on their own have a profound and very measurable impact to some scenario. However, a fair number of the changes are what we lovingly refer to as “peanut butter”, a thin layer of tiny performance-impacting changes that individually aren’t hugely meaningful but that over time add up to bigger impact. Sometimes these changes make one specific change in one place (e.g. One of my favorite sets of changes, and a pattern which will hopefully be codified in a future release by an analyzer, shows up in dotnet/runtime#49958, dotnet/runtime#50225, and dotnet/runtime#49969. removing one allocation), and it’s the aggregate of all such changes that helps . Sometimes it’s a pattern of change applied en mass across the stack. These PRs changed over 2300 internal and private classes across dotnet/runtime to be sealed. For some of the types, it won’t, but there are multiple reasons why sealing types can measurably improve performance, and so we’ve adopted a general policy that all non-public types that can be sealed should be, so as to maximize the chances use of these types will simply be better than it otherwise would be. One reason sealing helps is that virtual methods on a sealed type are more likely to be devirtualized by the runtime. If the runtime can see that a given instance on which a virtual call is being made is actually sealed, then it knows for certain what the actual target of the call will be, and it can invoke that target directly rather than doing a virtual dispatch operation. Better yet, once the call is devirtualized, it might be inlineable, and then if it’s inlined, all the previously discussed benefits around optimizing the caller callee combined kick in. Non Sealed() sub rsp,28 mov rcx,[rcx 10] mov rax,[rcx] mov rax,[rax 40] call qword ptr [rax 20] add eax,2A add rsp,28 ret ; Total bytes of code 26 ; Program. Sealed() mov rax,[rcx 8] cmp [rax],eax mov eax,2C ret ; Total bytes of code 12 isn’t calling anything: in fact, it’s been reduced to a null check followed by returning a constant value, as Sealed Type. M was devirtualized and inlined, at which point the JIT could constant fold the In order to measure the cost of a benchmark, it not only times how long it takes to invoke the benchmark but also how long it takes to invoke an empty benchmark with a similar signature, with the results presented subtracting the latter from the former. Benchmark Dot Net is then highlighting that with the method just returning a constant, the benchmark and the empty method are now indistinguishable. Another benefit of sealing is that it can make type checks a lot faster. Non Sealed() sub rsp,28 mov rdx,[rcx 8] mov rcx,offset MT_Program Non Sealed Type call CORINFO_HELP_ISINSTANCEOFCLASS test rax,rax setne al movzx eax,al add rsp,28 ret ; Total bytes of code 37 ; Program. Sealed() mov rax,[rcx 8] test rax,rax je short M00_L00 mov rdx,offset MT_Program Sealed Type cmp [rax],rdx je short M00_L00 xor eax,eax M00_L00: test rax,rax setne al movzx eax,al ret ; Total bytes of code 36), and the resulting impact that has on the cost of running this code. Yet another benefit here comes when using arrays of types. NET are covariant, which means if you have a type private Sealed Type _sealed Instance = new(); private Sealed Type[] _sealed Array = new Sealed Type[1_000_000]; private Non Sealed Type _non Sealed Instance = new(); private Non Sealed Type[] _non Sealed Array = new Non Sealed Type[1_000_000]; [Benchmark(Baseline = true)] public void Non Sealed() [Benchmark] public void Sealed() public class Non Sealed Type public sealed class Sealed Type ; Program. Non Sealed() sub rsp,28 mov rax,[rcx 10] test rax,rax je short M00_L01 mov rcx,offset MT_Program Non Sealed Type[] cmp [rax],rcx jne short M00_L02 lea rcx,[rax 10] mov r8d,[rax 8] M00_L00: mov [rdx],rcx mov [rdx 8],r8d mov rax,rdx add rsp,28 ret M00_L01: xor ecx,ecx xor r8d,r8d jmp short M00_L00 M00_L02: call System. Throw Array Type Mismatch Exception() int 3 ; Total bytes of code 64 ; Program. Sealed() mov rax,[rcx 8] test rax,rax je short M00_L01 lea rcx,[rax 10] mov r8d,[rax 8] M00_L00: mov [rdx],rcx mov [rdx 8],r8d mov rax,rdx ret M00_L01: xor ecx,ecx xor r8d,r8d jmp short M00_L00 ; Total bytes of code 35, but there were a bunch of places in the dotnet/runtime codebase still manually performing such operations when they could have been using these optimized helpers to do it for them. The PR addresses that, nicely changing complicated code like this in . The only reason the culture is required for parsing is to be able to parse a positive or negative symbol, but with these styles set, the implementation won’t actually use those symbol values, and thus the actual culture utilized doesn’t matter. Passing in also improved for hexadecimal inputs, thanks to dotnet/runtime#52470, which changed an internal routine used to determine whether a character is valid hex, making it branchless (which makes its performance consistent regardless of inputs or branch prediction) and removing the dependency on a lookup table. Corresponding functionality on Then there’s dotnet/runtime#51190, which recognizes that, at a very low-level, when extending a 32-bit value in a 64-bit process to be native word size, it’s ever so slightly more efficient from a codegen perspective to zero-extend rather than sign-extend; if the code is happening on a path where those are identical (i.e. we know by construction we don’t have negative values), on a really hot path it can be beneficial to change. Along with the new and improved support for interpolated strings, a lot of cleanup across dotnet/runtime was also done with regards to string formatting. dotnet/runtime#50267, dotnet/runtime#55738, dotnet/runtime#44765, dotnet/runtime#44746, and dotnet/runtime#55831 all updated code to use better mechanisms. dotnet/runtime#commits/91f39e alone updated over 3000 lines of string-formatting related code. Some of these changes are to use string interpolation where it wasn’t used before due to knowledge of the performance implications; for example, there’s code to read the , the culture is only needed if the number is negative and would require looking up the negative sign symbol for the relevant culture, but here we know that process IDs are never negative, making the culture irrelevant. NET 6 to improve the performance of different aspects of usage in the library, including changing a bunch of call sites from using a variable to instead using a constant, the latter of which the JIT can better optimize. As a bonus, the implementation casts the known non-negative value to are strings. NET 5 delivered an enhanced version of the library, providing new APIs and even better performance, and . But arguably the biggest performance improvement in needs information about the types it’s serializing to know what what to serialize and how to serialize it. This is forcing the system to parse the composite format string and incur all the associated overhead, when at the end of the day this can be a simple concat operation, which the C# compiler will employ for an interpolated string when all the parts are strings and there are sufficiently few to enable using one of the non-params-array rather than having to allocate a string to determine what matched. It retrieves that data via reflection, examining for example what properties are exposed on a type and whether there are any customization attributes applied. Then there are the plethora of changes that remove an array or boxing allocation here, an unnecessary LINQ query there, and so on. But reflection is relatively expensive, and certainly not something you’d want to do every time you serialized an instance of a type, so caches that information. That cached information may include, for example, delegates used to access the properties on an instance in order to retrieve the data that needs to be serialized. Depending on how the is configured, that delegate might use reflection to invoke the property, or if the system permits it, it might point to specialized code emitted via reflection emit. Unfortunately, both of those techniques have potential downsides. Gathering all of this data, and potentially doing this reflection emit work, at run-time has a cost, and it can measurably impact both the startup performance and the working set of an application. Generating source during a build is nothing new; these techniques have been used in and out of the . What is new, however, is the C# compiler making the capability a first-class feature, and core libraries in . Just as the compiler allows for analyzers to be plugged into a build to add custom analysis as part of the compiler’s execution (with the compiler giving the analyzer access to all of the syntactical and semantic data it gathers and creates), the compiler now also enables a source generator to access the same information and then spit out additional C# code that’s incorporated into the same compilation unit. Serialization; namespace Json Example; class Program internal class Blog Post [Json Serializable(typeof(Blog Post))] internal partial class My Json Context : Json Serializer Context What’s interesting, however, is what happened behind the scenes. It also leads to increased size, as all of the code necessary to enable this (including support for reflection emit itself) needs to be kept around just in case the serializer needs it. This makes it very attractive for doing certain operations at compile-time that code may have been doing previously via reflection and reflection emit at run-time… If you look again at Solution Explorer, you’ll see a bunch of code the JSON source generator output: Those files essentially contain all of the glue code reflection and reflection emit would have generated, including lines like: is then able to use these delegates just as it’s able to use ones that use reflection or that were generated via reflection emit. like analyzing types as part of a serializer in order to generate fast member accessors. As long as the source generator is spitting out all this code for doing at compile-time what was previously done at run-time, it can take things a step further. dotnet/runtime#51149, dotnet/runtime#51300, and dotnet/runtime#51528 introduce a new using System; using System. If I were writing my own serializer customized specifically for my private Utf8Json Writer _writer = new Utf8Json Writer(Stream. NET implementation of the Tech Empower caching benchmark to use the JSON source generator. Null); private Blog Post _blog Post = new Blog Post ; [Benchmark(Baseline = true)] public void Json Serializer Without Fast Path() [Benchmark] public void Json Serializer With Fast Path() [Benchmark] public void Direct Fast Path() The impact of these improvements can be quite meaningful. Previously, a significant portion of the time in that benchmark was being spent doing JSON serialization using , making it a prime candidate. Json source generator provides a lot more detail and background. NET 6 is another source generator, this time one related to interop. NET, C# code can call out to native C functions via the P/Invoke (Platform Invoke) mechanism, whereby a s are referred to as being “blittable,” which really just means the runtime doesn’t need to do any special transformation or marshaling as part of the call (that includes the signature’s types being blittable, but also the that actually is blittable. With the changes to use the source generator and benefit from the fast path implicitly being used, the benchmark gets ~30% faster. Doing that at run-time has multiple downsides, including the startup impact on having to do this code generation on first use. NET 7 we plan to enable a source generator to do it, and the groundwork has been laid in . While the P/Invoke source generator won’t ship as part of . NET 6, as part of that prototype various investments were made that will ship in . NET 6, such as changing , and elsewhere to be blittable, as well. Another area in which we’ve seen cross-cutting improvements in . NET 6 is via the use of function pointers to simplify and streamline interop. C# 9 added support for function pointers, which, via the type, which was implemented in dotnet/runtime#54136 from @tmds, dotnet/runtime#55333, and dotnet/runtime#55552 to enable code to register a callback to handle a POSIX signal. Both the Unix and Windows implementations of this type need to hand off to native code a callback to be invoked when a signal is received. On Unix, the native function that’s called to register the callback is declared as: expecting a function pointer it can invoke. Thankfully, on the managed side we want to hand off a static method, so we don’t need to get bogged down in the details of how we pass an instance method, keep the relevant state rooted, and so on. Instead, we can declare the method; instead, we just pass down the method’s address. app6 - These are feature switches, and serve two purposes. NET Core or Blazor WASM or mobile apps, come from the same source in dotnet/runtime, but there are actually two runtime implementations in dotnet/runtime: “coreclr” and “mono”. This ends up being simpler and more efficient, and multiple PRs in . First, they can be queried by code in the app (and, in particular, in the core libraries) to determine what functionality to employ. In this blog post, when I’ve talked about runtime improvements in components like “the” JIT and GC, I’ve actually been referring to coreclr, which is what’s used for console apps, ASP. Blazor Web Assembly, however, relies on mono, which has been honed over the years to be small and agile for these kinds of scenarios, and has also received a lot of performance investment in . There are three significant areas of investment here. NET 6 converted delegate-based interop to function pointer-based interop. For example, if you search dotnet/runtime for “Globalization Mode. The first is around improvements to the IL interpreter in mono. dotnet/runtime#43793 and dotnet/runtime#43514 converted a bunch of interop on both Windows and Unix to use function pointers. Invariant”, you’ll find code along the lines of: block, and if that ends up meaning additional types and members become unused, they can be removed, as well. The runtime is itself compiled to WASM, downloaded to the browser, and used to execute the application and library code on which the app depends. Mono not only has a JIT capable of on-demand assembly generation ala coreclr, it also supports interpreting IL, which is valuable on platforms that for security reasons prohibit executing machine assembly code generated on the fly. dotnet/runtime#54636 and dotnet/runtime#54884 did the same for to being a fully managed implementation rather than using “FCalls” into the runtime to get the current time, and in doing so used function pointers to be able to store a pointer to desired native OS function for getting the current time. See: https://aka.ms/dotnet-core-preview app6 - D:\examples\app6\bin\Release\net6.0\wwwroot Optimizing assemblies for size, which may change the behavior of the app. See: https://aka.ms/dotnet-illink Compiling native assets with emcc. By setting the two aforementioned switches, we’re eliminating any need the app has for the ICU globalization library, which is a significant portion of the app’s size, both in terms of the logic linked into , removing another 750KB from the app’s size, more than 40% of where we were before. I say “the runtime” here, but in reality there are actually multiple incarnations of a runtime for . dotnet/runtime#46037 overhauled the interpreter to move it from being stack-based (where IL instructions push and pop values from a stack) to being based on the concept of reading/writing local variables, a switch that both simplified the code base and gave it a performance boost. dotnet/runtime#52090 converted the mac OS implementation of allocation-related implementations out of native code in the runtimes up into C#. Ok, so we’ve got our Blazor WASM app, and we’re able to ship a small package down to the browser to execute it. dotnet/runtime#48513 improved the interpreter’s ability to inline, in particular for methods attributed with in places it’s been measured to yield impactful gains. In doing so, a fair amount of overhead was removed, in particular on Unix where a layer of wrappers was removed, as is evident from this benchmark run on Ubuntu: class (dotnet/runtime#54006) provides fast APIs for allocating, reallocating, and freeing native memory, with options including requiring the memory having a particular alignment or having the memory be forcibly zeroed out (note the above numbers and the below numbers were taken on different machines, the above on Ubuntu and the below on Windows, and are not directly comparable). Http; using var listener = new Http Console Listener(); using var hc = new Http Client(); await hc. dotnet/runtime#50361, dotnet/runtime#51273, dotnet/runtime#52130, and dotnet/runtime#52242 all served to optimize how various kinds of instructions were encoded and invoked, and dotnet/runtime#51309 improved the efficiency of blocks by removing overhead associated with thread aborts, which no longer exist in . NET Framework 4.8 and earlier have the concept of a thread abort, where one thread can inject a special exception into another, and that exception could end up being thrown at practically any instruction; by default, however, they don’t interrupt blocks). NET 5 added literally thousands of new methods, each of which map effectively 1:1 with some hardware-specific instruction, enabling C# code to directly target functionality from various ISAs (Instruction Set Architectures) like SSSE3 or AVX2. start tracing and the relevant data will be written by the operating system to a log for subsequent analysis with a tool like Visual Studio, Perf View, or Windows Performance Analyzer. Get String Async("https://dotnet.microsoft.com/"); sealed class Http Console Listener : Event Listener [2021-08-06T.4758871Z] Request Start: scheme=https, host=dotnet.microsoft.com, port=443, path And Query=/, version Major=1, version Minor=1, version Policy=0 [2021-08-06T.5981990Z] Connection Established: version Major=1, version Minor=1 [2021-08-06T.5995700Z] Request Left Queue: time On Queue Milliseconds=86.1312, version Major=1, version Minor=1 [2021-08-06T.6011745Z] Request Headers Start: [2021-08-06T.6019475Z] Request Headers Stop: [2021-08-06T.7591555Z] Response Headers Start: [2021-08-06T.7628194Z] Response Headers Stop: [2021-08-06T.7648776Z] Response Content Start: [2021-08-06T.7665603Z] Response Content Stop: [2021-08-06T.7667290Z] Request Stop: [2021-08-06T.7684536Z] Connection Closed: version Major=1, version Minor=1 like this is very flexible and allows for a myriad of interesting uses, so we still want to drive down the overhead associated with all of these callbacks. See: https://aka.ms/dotnet-core-preview app6 - D:\examples\app6\bin\Release\net6.0\wwwroot Optimizing assemblies for size, which may change the behavior of the app. See: https://aka.ms/dotnet-illink Compressing Blazor Web Assembly publish artifacts. app6 - of PRs contributed here, as most changes shave off a few bytes here and a few bytes there. The second area of investment was around hardware intrinsics. Of course, something needs to be able to translate the C# methods into the underlying instructions they represent, which means a lot of work to fully enable every code generator. We want to hear from you, and even more than that, we’d love your involvement. The most general of all the data to trace, and dotnet/runtime#54925 reduced the overhead of using this API, specifically when the data is being consumed by ETW, which has dedicated code paths in the implementation; the PR reduced allocation by 3-4x for basic use cases by avoiding multiple temporary file or to an out-of-process consumer, such as a tool like dotnet-counters. dotnet/runtime#44026, dotnet/runtime#51822, dotnet/runtime#52092, and dotnet/runtime#52455 all contributed here, doing things like wrapping a There are many things that impact how long it takes an application to start up. NET has technology like tiered JIT compilation and Ready To Run. NET runtime is “just” a library that can be hosted inside of a larger application, the “host”; that host calls into various APIs that initialize the runtime and invoke static methods, like that pops out is a . What that host does or does not do can have a significant impact on the startup performance of the app, and investments were made in . One of the most expensive things a host can do is file I/O, especially if there’s a lot of it. Here are some example changes that were made in the name of size, as they can help to highlight the kinds of changes applications and libraries in general can make to help reduce their footprint: file, but when we trim the app as part of publishing, we’re only trimming the managed assemblies, not the runtime, as the SDK itself doesn’t include the tools necessary to do so. Mono supports using LLVM for code generation, and a bunch of PRs improved the LLVM-enabled mono’s support for hardware intrinsics, whether it be dotnet/runtime#49260, dotnet/runtime#49737, dotnet/runtime#48361, and dotnet/runtime#47482 adding support for ARM64 Adv Simd APIs; dotnet/runtime#48413, dotnet/runtime#47337, and dotnet/runtime#48525 rounding out the support for the Sha1, Sha256, and Aes intrinsics; or dotnet/runtime#54924 and dotnet/runtime#47028 implementing foundational support with . 🙂 Whether it be for (dotnet/runtime#49988 from @kronic, dotnet/runtime#54344, dotnet/runtime#54299, dotnet/runtime#54346, dotnet/runtime#54356, dotnet/runtime#54836), or caching (dotnet/runtime#51761, dotnet/runtime#45410, dotnet/runtime#45563, dotnet/runtime#45280), or (dotnet/runtime#44691, dotnet/runtime#43365 from @am11), or any number of other areas, there have been an untold number of performance improvements in . Of the ~400 merged PRs linked to in this blog post, over 15% of them came from the . Given the high rate at which data can be generated, it’s important that this mechanism be as low-overhead as possible. Managed code prior to an application’s But beyond both of those, there’s the . dotnet/runtime#50671 tries to reduce startup time by avoiding the file existence checks that were being performed for each file listed in file as part of builds; this file contained additional probing paths that weren’t actually necessary and were causing the host to probe more than necessary and negating the wins from the previous PR. We can rectify that by installing the dotnet publish -c Release Microsoft (R) Build Engine version 17.0.0-preview-21411-06 b0bb46ab8 for . Many of the library performance improvements highlighted in previous blog posts rely on the throughput improvements from vectorization, which then accrue here as well, which includes when building Blazor WASM apps with AOT. NET 6 is exactly the same, and is implemented in C#, making it a reasonable test case. That ~40% improvement is due to the interpreter optimizations, as we’re otherwise running the exact same code. I modify the Then I republish (and get a cup of coffee… With that, I now get results like this: so ~0.018 seconds, making it ~16x faster than it was before. NET 6 that I haven’t been able to do justice to in this post. NET community outside of Microsoft, and we’d love to see that number grow even higher. dotnet/runtime#50797 changed how access to buffers in Event Pipe were synchronized, leading to significant increases in event throughput, on all operating systems. If either ETW or Event Pipe was being used to consume events, would P/Invoke into native code for each, but if only one of them was being used, that would lead to an unnecessary P/Invoke; the PR addressed this simply by checking whether the P/Invoke is necessary based on the known state of the consumers. On top of that, dotnet/runtime#53631 also helped reduce overheads by removing unnecessary string copies in the hosting layer, shaving milliseconds off execution time. And that brings us to the third, and arguably most impactful, area of investment: AOT for Blazor WASM. I’ve replaced the entire contents of the Counter.razor file in both of those projects with this, which in response to a button click is simply SHA-256 hashing a byte array of some UTF8 Shakespeare several thousand times. There are also many outstanding PRs in dotnet/runtime that haven’t yet been merged but may be for . For example, dotnet/runtime#57079 enables support for TLS resumption on Linux, which has the potential to improve the time it takes to establish a secure connection by an order of magnitude. If you’ve got ideas for improvements or the inclination to try to make them a reality, please join us for a fun and fulfilling time in dotnet/runtime. All told, this adds up to sizeable reductions in app startup. I highlighted earlier that Blazor WASM apps targeting . Or dotnet/runtime#55745, which enables the JIT to fold . For this example, I used: highlighting an ~30% reduction in the cost of executing this “Hello, World” app. NET performance, throughput and memory have been the primary two metrics on which I’ve focused. See: https://aka.ms/dotnet-core-preview app5 - D:\examples\app5\bin\Release\net5.0\wwwroot Optimizing assemblies for size, which may change the behavior of the app. See: https://aka.ms/dotnet-illink Compressing Blazor Web Assembly publish artifacts. app5 - dotnet publish -c Release Microsoft (R) Build Engine version 17.0.0-preview-21411-06 b0bb46ab8 for . NET 5 were interpreted, meaning while the runtime itself was compiled to WASM, the runtime then turned around and interpreted the IL for the app and the libraries it depends on. NET 6, a Blazor WASM app can be compiled ahead of time entirely to Web Assembly, avoiding the need for JIT’ing or interpreting at run-time. So much goodness already merged and so much more on the way. Of late, however, another metric has been getting a lot of attention: size, and in particular size-on-disk for a self-contained, trimmed application. All of these improvements together lead to huge, cross-cutting performance improvements for Blazor WASM apps when targeting . That’s primarily because of the Blazor Web Assembly (WASM) application model, where an entire . NET 6, inclusive of changes in dotnet/runtime as well as in mono/linker, which provides the trimmer that analyzes and rewrites assemblies to remove (or “trim”, or “tree shake”) unused functionality. NET 6 actually went into trimming safety, making it possible for any of the core libraries to be used in a trimmed application such that either everything that’s needed will be correctly kept or the trimmer will produce warnings about what’s wrong and how the developer can fix it. NET application, inclusive of the runtime, is downloaded to and executed in a browser. However, there was a sizable effort (pun intended, I’m so funny) on the size reduction itself. NET 5 Blazor WASM application using It works, nice. To start, let’s take a look at what size looked like for . Now, I can publish it, which will create and trim the whole application, and produce all the relevant assets ready for pushing to my server; that includes Brotli-compressing all the required components. dotnet publish -c Release Microsoft (R) Build Engine version 17.0.0-preview-21411-06 b0bb46ab8 for . part may be reproduced in any form without explicit written permission.

2017-2018 © sunbet88.us