Thanks for writing this. It does a good job of highlighting the importance of memory layout and how much idiomatic code in languages like Java and C# runs into problems with that. When reading this I couldn't help but thing of a paper I read back in grad school (http://www.cc.gatech.edu/~harrold/6340/cs6340_fall2009/Readings/choi99escape.pdf) on escape analysis in Java. The goal of this analysis was two fold. First, to allow stack allocation for anything that doesn't escape the current method. Second, remove synchornization for anything that doesn't escape the current thread. You only briefly touch on this type of thing at the end and pretty much dismiss it. This was written back in 1999, so there has been a lot of time to try to get this type of thing into compilers. It seems to me that using this to enforce stack allocation in the compiler could theoretically address the issue of most code allocating small objects. Do you have any idea why that hasn't happened? I haven't followed program analysis much since about 2001, so I'm wondering if this line of word ran into some fundamental problem that they couldn't get around.
"If you’re spending 99.9% of your time waiting on the network..." your program is scarcely running at all. Make the code multithreaded and increase the load. Then you will discover what the program is actually spending time on. Accessing memory dominates the cost of running a program on PC-class and handset-class hardware. Calling into the memory manager is expensive precisely because it makes scattered access to memory that defeats caching. No programming language will be efficient unless it encourages stack-based or static allocation of objects. Doesn't matter how tricksy the GC is. Anyone who says different is selling you something.
@markclewis The reason I dismiss escape analysis is that it's so limited in scope. It sometimes helps, and that's great, but you can't enforce it. Some innocent-looking change in the future might fall on the wrong side of some opaque heuristic in the compiler and all of a sudden you're allocating a gigabyte per second. That's no way to get predictable and fast code. The other problem with these "automatic" systems is that this is fundamentally a programmer-in-the-loop problem. The reason C++ programs don't allocate as much is because the programmer makes *different* choices about how to manage objects, than a programmer in C#. Most of the time these choices have repercussions that a compiler wouldn't be allowed to introduce, but when a programmer does it it's fine because the programmer knows what they're doing. So I think fundamentally this stuff has to be a language feature, not a compiler feature. Having escape analysis is great, and can give you some nice freebies, but it's the nature of automated analysis like that to be limited in what it can achieve because the p rogrammer is never told "oooh, you can *almost* allocate this object on the stack, but *this* thing over here is preventing it".
Great post. Could you clarify this sentence: "C# typically encourages you to allocate each part of that data in a separate heap allocation". Separate heap allocation? What do you mean?
Jeffrey Richter says in CLR via C#: "Because the managed heap allocates these objects next to each other in memory, you get excellent performance when accessing these objects due to locality of reference. Specifically, this means that your process’s working set is small, which means your application runs fast with less memory. It’s also likely that the objects your code is accessing can all reside in the CPU’s cache. The result is that your application will access these objects with phenomenal speed because the CPU will be able to perform most of its manipulations without having cache misses that would force slower access to RAM." He seems to say the opposite...
While I disagree with a few statements in the article (some aren't quite correct), I agree with the gist of the article. And I certainly agree with the statements regarding the unnecessary allocations in the BCL and C# class design. It's certainly possible to avoid boxing an int or float to print it into a StringBuilder, but what you can't avoid is the fact that appending it to a StringBuilder will allocate a temporary string (immediate garbage). And in order for that string to be created, it needs to create a temporary StringBuilder (which I believe they at least pool). There's not even an exposed way to concatenate two StringBuilders together without converting one to a string first. :( “Because the managed heap allocates these objects next to each other in memory, you get excellent performance when accessing these objects due to locality of reference." ... "Specifically, this means that your process’s working set is small, which means your application runs fast with less memory. " This is true only if you're accessing the objects in roughly the same order that they were allocated and that you allocated all objects together (sequential in memory). The main 'memory' benefit of the GC in C#/.NET is that it compacts and will reduce heap fragmentation. However, the second statement above is certainly not always correct. The GC runs only after a certain threshold of allocations is reached. During this time the GC may waste up to 2x or more memory than its lowest usage. Newly allocated objects will also likely be scattered throughout this memory until a GC is run and temporary (garbage) objects are removed and the memory compacted. Compacting memory also takes time and the .NET GC must pause the application during this time as objects are moved around in memory. As a test, I mocked up an example in C#/.NET Core to simulate removing, adding, and updating "Entities" in a list per frame. At 10k entities in the list, I am seeing 1ms GC pause times ~ 10x what a normal "frame" runs at. At 100k entities in the list, the pause times go up to 18-20ms which is ~5x slower than a normal "frame". Basically more than a single frame's entire time at 60fps. Now I have no dreams of running 100k entities, but the 10k estimate is most likely grossly understated as my entity does not contain additional references/allocations that must be visited during the mark. For comparison, the same 100k test in C++ runs 1.8x faster than the fastest C# frames and suffers no GC pauses. I would love to see an iterative GC with *low* pause times (1ms or less) for C#, but it does not look like it will happen in the near future.