Stuff Gil Says: What sort of allocation rates can servers handle?

First, a side note: This blog entry is a [nearly verbatim] copy of a posting I made on the Mechanical Sympathy Google Group. I'm lazy. But I recycle. So I think of that as a net positive.

The discussion in question actually started from a question about what a good choice of hardware for running a low latency application may looks like these days, but then evolved into other subjects (as many of the best discussions on the group do), one of which was allocation rates.

Several smart and experienced people on the group chimed in and shared their hard earned wisdom, a lot of which came down to recommendations like "keep your allocation rates low", and "reducing allocation rates is one of the best tools to improve application behavior/performance". Specific numbers were cited (e.g. "My current threshold ... is ~300-400MB/sec").

This made that big "Java applications work hard to use less than 5% of today's toy server capabilities" chip I carry on my shoulder itch. I decided to scratch the itch by pointing out that one thing (and one thing only) is making people work hard to keep their apps within those silly limits: It's all about GC Pauses.

To support my claim, I went on a trivial math spree to show that even today's "toy" commodity servers can easily accommodate a rate of allocation 50x higher than the levels people try and contain their applications within, and that the only bad thing about a higher allocation rate is higher pause artifacts. In the poor systems that have those pause artifacts, of course....

The rest of the below is the actual posting:

...

These "keep allocation rates down to 640KB/sec" (oh, right, you said 300MB/sec) guidelines are purely driven by GC pausing behavior. Nothing else.

Kirk (and others) are absolutely right to look for such limits when pausing GCs are used. But the *only* thing that makes allocation rates a challenge in todays Java/.NET (and other GC based) systems is GC pauses. All else (various resources spent or wasted) falls away with simple mechanical sympathy math. Moore's law is alive and well (for now). And hardware-related sustainable allocation rate follows it nicely. 20+GB/sec is a very practical level on current systems when pauses are not an issue. And yes, that's 50x the level at which people seem to "tune" for by crippling their code or their engineers...

Here is some basic mechanical sympathy driven math about sustainable allocation rates (based mostly on Xeons):

1. From a speed and system resources spent perspective, sustainable allocation rate roughly follows Moore's law for the past 5 Xeon CPU generations.

1.1 From a CPU speed perspective:

The rate of sustainable allocation of a single core (at a given frequency) is growing very slowly over time (not @ Moore's law rates, but still creeping up with better speed at similar frequency, e.g. Haswell vs. Nehalem).

The number of cores per socket is growing nicely, and with it the overall overall CPU power per socket (@ roughly Moore's law). (e.g. from 4 cores per socket in late 2009 to 18 cores per socket in late 2014).

The overall CPU power available to sustain allocation rate per socket (and per 2 socket system, for example) is therefore growing at roughly Moore's law rates.

1.2 From a cache perspective:

L1 and L2 cache size per core have been fixed for the past 6 years in the Xeon world.

The L3 cache size per core is growing fairly slowly (not at Moore's law rates), but the L3 cache per socket has been growing slightly faster than number of cores per socket. (e.g. from 8MB/4_core_socket in 2009 to 45MB/18_core_socket in late 2014).

The cache size per socket has been growing steadily at Moore's law rates.

With the cache space per core growing slightly over time, the cache available for allocation work per core remains fixed or better.

1.3 From a memory bandwidth point of view:

The memory bandwidth per socket has been steadily growing, but at a rate slower than Moore's law. E.g. A late 2014 E5-2690 V3 has a max bandwidth of 68GB/sec. per socket. A late 2009 E5590 had 32GB/sec of max memory bandwidth per socket. That's a 2x increase over a period of time during which CPU capacity grew by more than 4x.

However, the memory bandwidth available (assume sustainable memory bandwidth is 1/3 or 1/2 of max), is still WAY up there, at 1.5GB-3GB/sec/core (that's out of a max of about 4-8GB/sec per core, depending on cores/socket chosen).

So while there is a looming bandwidth cap that may hit us in the future (bandwidth growing slower than CPU power), It's not until we reach allocation levels of ~1GB/sec/core that we'll start challenging memory bandwidth in current commodity server architectures.

From a memory bandwidth point of view, this translates to >20GB/sec of comfortably sustainable allocation rate on current commodity systems..

1.4 From a GC *work* perspective:

From a GC perspective, work per allocation unit is a constant that the user controls (with ratio or empty to live memory).

On Copying or Mark/Compact collectors, the work spent to collect a heap is linear to the live set size (NOT the heap size).

The frequency at which a collector has to do this work roughly follows:

allocation_rate / (heap_size - live_set_size)

The overall work per time unit is therefore follows allocation rate (for a given live_set_size and heap_size).

And the overall work per allocation unit is therefore a constant (for a given live_set_size and heap_size)

The constant is under the user's control. E.g. user can arbitrarily grow heap size to decrease work per unit, and arbitrarily shrink memory to go the other way (e.g. if they want to spend CPU power to save memory).

This math holds for all current newgen collectors, which tend to dominate the amount of work spent in GC (so not just in Zing, where it holds for both newgen and olden).

But using this math does require a willingness to grow the heap size with Moore's law, which people have refused to do for over a decade. [driven by the unwillingness to deal with the pausing effects that would grow with it]

[BTW, we find it to be common practice, on current applications and on current systems, to deal with 1-5GB/sec of allocation rate, and to confortably do so while spend no more than 2-5% of overall system CPU cycles on GC work. This level seems to be the point where most people stop caring enough to spend more memory on reducing CPU consumption.]

2. From a GC pause perspective:

This is the big bugaboo. The one that keeps people from applying all the nice math above. The one that keeps Java heaps and allocation rates today at the same levels they were 10 years ago. The one that seems to keep people doing "creative things" in order to keep leveraging Moore's law and having programs that are aware of more than 640MB of state.

GC pauses don't have to grow with Moore's law. They don't even have to exist. But as long as they do, and as long as their magnitude grows with the attempt to linearly grow state and allocation rates. Pauses will continue to dominate people's tuning and coding decisions and motivations. [and by magnitude, we're not talking about averages. We're talking about the worst thing people will accept during a day.]

GC pauses seem to be limiting both allocation rates and live set sizes.

The live set size part is semi-obvious: If your [eventual, inevitable, large] GC pause grows with the size of your live set or heap size, you'll cap your heap size at whatever size causes the largest pause you are willing to bear. Period.

The allocation rate part requires some more math, and this differs for different collector parts:

2.2 For the newgen parts of collector:

By definition, a higher allocation rate requires a linearly larger newgen sizes to maintain the same "objects die in newgen" properties. [e.g. if you put 4x as many cores to work doing the same transactions, with the same object lifetime profiles, you need 4x as much newgen to avoid promoting more things with larger pauses].

While "typical" newgen pauses may stay just as small, a larger newgen linearly grows the worst-case amount of stuff that a newgen *might* promote in a single GC pauses, and with it grows the actual newgen pause experienced when promotion spikes occur.

Unfortunately, real applications have those spikes every time you read in a bunch of long-lasting data in one shot (like updating a cache or a directory, or reading in a new index, or replicating state on a failover),

Latency sensitive apps tend to cap their newgen size to cap their newgen pause times, in turn capping their allocation rate.

2.3 For oldgen collectors:

Oldgen collectors that pause for *everything* (like ParallelGC) actually don't get worse with allocation rate. They are just so terrible to begin with (pausing for ~1 second per live GB) that outside of batch processing, nobody would consider using them for live sets larger than a couple of GB (unless they find regularly pausing for more than a couple of seconds acceptable).

Most Oldgen collectors that *try* to not pause "most" of the time (like CMS) are highly susceptible to allocation rate and mutation rate (and mutation rate tends to track allocation rate linearly in most apps). E.g. the mostly-concurrent-marking algorithms used in CMS and G1 must revisit (CMS) or otherwise process (G1's SATB) all references mutated in the heap before it finishes. The rate of mutation increases the GC cycle time, while at the same time the rate of allocation reduces the time the GC has in order to complete it's work. At a high enough allocation rate + mutation rate level, the collector can't finish it's work fast enough and a promotion failure or a concurrent mode failure occurs. And when that occurs, you get that terrible pause you were trying to avoid.

As a result, even for apps that don't try to maintain "low latency" and only go for "keep the humans happy" levels, most current mostly-concurrently collectors only remain mostly-concurrent within a limited allocation rate. Which is why I suspect these 640KB/sec (oh, right, 300MB/sec) guidelines exist.

Bottom line:

When pauses are not there to worry about, sustaining many GB/sec of allocation is a walk in the park on today's cheap, commodity servers. It's pauses, and only pauses, that make people work so hard to fit their applications in a budget that is 50x smaller than what the hardware can accommodate. People that do this do it for good reason. But it's too bad they have to shoulder the responsibility for badly behaving garbage collectors. When they can choose (and there is a choice) to use collectors that don't pause, the pressure to keep allocation rates down changes, moving the "this is too much" lines up by more than an order of magnitude.

With less pauses comes less responsibility.

[ I need to go do real work now... ]

5 comments:

UnknownJanuary 26, 2015 at 11:54 PM
This comment has been removed by a blog administrator.
AnonymousMarch 9, 2015 at 12:44 AM
This comment has been removed by a blog administrator.
UnknownJune 2, 2015 at 6:42 AM
Hi, just got here after listening to Your brilliant "Understanding Java GC" presentation from 2012 (thank You a TON for compiling and sharing it!!)

As much as I'd like us to try Zing out - it'd take lots of time before it reaches production datacenter,
and for short-term - I don't think the miracle of plugglable-GCs exists in JVMs (to use C4 instead of CMS).

Since our throughput had grown - we're facing >100sec GC pauses, but RAM isn't over-utilized at all (live data is less than 5GB).
To mitigate the "length of GC pause exceeds human/software timeouts" - I'm planning to try a bit counter-intuitive approach:
- REDUCE the heap_size 10x times
- deploy 10x tomcat instances
- reduce live_set_size "almost 10x" times (balancer routes calls per user-hash, so we can reduce user-obj caches 10x, though static/config caches will remain same)

If understood Your math corrcetly - this will cause each instance to have:
- almost-same frequency of fullGCs (allocation_rate/10, heap_size/10, live_set_size/almost_10)
- 10x shorter duration of each fullGC (heap_size/10)
And overall CPU use should remain same (same load spread into smaller buckets).

I understand the approach is not that efficient in comparison to C4, but still effective up to some NNN-times scale (100x?), with following limitations:
1) at some scale the need to keep NNN copies of static/config caches (live set) should skew the "almost-NNNx" math into impractical
2) non-uniformness of user-data (large users) will start to suffer from walls of too-small-buckets

Just curious if any mainstream JVM is capable of releasing unused heap (above -Xms level) back to OS? (this would've addressed the #2 above)
thempiJuly 10, 2022 at 11:28 PM
v257s7fjijz677 cheap sex toys,cheap sex toys,dildo,sex chair,wolf dildo,sex chair,vibrators,realistic sex dolls,horse dildo u103x3edfcv959

Friday, October 31, 2014

What sort of allocation rates can servers handle?

5 comments: