> Regarding cache-line size, sysctl on macOS reports a value of 128 B, while getconf and the CTR_EL0 register on Asahi Linux returns 64 B, which is also supported by our measurements.
Cool paper! The authors use the fact that the M1 chip supports both ARM's weaker memory consistency model and x86's total order to investigate the performance hit from using the latter, ceteris paribus.
They see an average of 10% degradation on SPEC and show some synthetic benchmarks with a 2x hit.
For example, modern x86 architectures still readily out-perform ARM64 in performance-engineered contexts. I don’t think that is controversial. There are a lot of ways to explain it e.g. x86 is significantly more efficient in some other unrelated areas, x86 code is tacitly designed to minimize the performance impact of TSO, or the Apple Silicon implementations nerf the TSO because it isn’t worth the cost to optimize a compatibility shim. TSO must have some value in some contexts, it wasn’t chosen arbitrarily.
Apple Silicon is also an unconventional implementation of ARM64, so I wonder the extent to which this applies to any other ARM64 implementation. I’d like to see more thorough and diverse data. It feels like there are confounding factors.
I think it is great that this is being studied, I’m just not sure it is actionable without much better and more rigorous measurement across unrelated silicon microarchitectures.
The programs that see the most benefit of WO vs TSO are poorly written multithreaded programs. Most of the software you actually use might be higher quality than that?
> TSO must have some value in some contexts, it wasn’t chosen arbitrarily.
Ehhh. I think they might have just backed themselves into it? I believe Intel initially claimed SeqCst but the chips never implemented that and the lack was observable. TSO happened to accurately describe the existing behavior of early multicore Intel chips and they can't exactly relax it now without breaking existing binaries.
Google's AI slop claims Intel published something vague in 2007, and researchers at Cambridge came up with the TSO name and observation in 2009 ("A Better x86 Memory Model: x86-TSO").
Apple M4 cpu is pretty much kimg in terms of single threaded performance. In multithreaded the M4 ultra of course loses against extreme high core count server CPUs. But I think it's wrong to say that x86 readily outperforms ARM64. Apple essentially dominates in all CPU segments they are in.
But x86_64 does outperform ARM64 in high-performance workloads. High-performance workloads are not single-threaded programs. Maybe if Apple decides one day to manufacture the server CPU, which I believe they will not since they would have to open their chips to Linux. OTOH server aarch64 implementations such Neoverse or Graviton are not as good as x86_64 in terms of absolute performance. Their core design cannot yet compete.
This comment is a two sentence summary of the six sentence Abstract at the very top of the linked article. (Though the paper claims 9%, not 10% -- to three sig figs, so rounding up to 10% is inappropriate.)
Also -- 9% is huge! I am kind of skeptical of this result (haven't yet read the paper). E.g., is it possible ARM's TSO order isn't optimal, providing a weaker relative performance than a TSO native platform like x86?
> An application can benefit from weak MCMs if it distributes its workload across multiple threads which then access the same memory. Less-optimal access patterns might result in heavy cache-line bouncing between cores. In a weak MCM, cores can reschedule their instructions more effectively to hide cache misses while stronger MCMs might have to stall more frequently.
So to some extent, this is avoidable overhead with better design (reduced mutable sharing between threads). The impact of TSO vs WO is greater for programs with more sharing.
> The 644.nab_s benchmark consists of parallel floating point calculations for molecular modeling. ... If not properly aligned, two cores still share the same cache-line as these chunks span over two instead of one cache-line. As shown in Fig. 5, the consequence is an enormous cache-line pressure where one cache-line is permanently bouncing between two cores. This high pressure can enforce stalls on architectures with stronger MCMs like TSO, that wait until a core can exclusively claim a cache-line for writing, while weaker memory models are able to reschedule instructions more effectively. Consequently, 644.nab_s performs 24 percent better under WO compared to TSO.
Yeah, ok, so the huge magnitude observed is due to some really poor program design.
> The primary performance advantage applications might gain from running under weaker memory ordering models like WO is due to greater instruction reordering capabilities. Therefore, the performance benefit vanishes if the hardware architecture cannot sufficiently reorder the instructions (e.g., due to data dependencies).
Read the thing all the way through. It's interesting and maybe useful for thinking about WO vs TSO mode on Apple M1 Ultra chips specifically, but I don't know how much it generalizes.
I’m not an expert… but it seems like it could be even simpler than program design. They note false sharing occurs due to data not being cacheline aligned. Yet when compiling for ARM, that’s not a big deal due to WO. When targeting x86, you would hope the compiler would work hard to align them! So the out of the box compiler behavior could be crucial. Are there extra flags that should be used when targeting ARM-TSO?
I’ve seen the stronger x86 memory model argued as one of the things that affects its performance before.
It’s neat to see real numbers on it. Didn’t seem to be very big in many circumstances which I guess would have been my guess.
Of course Apple just implemented that on the M1 and AMD/Intel had been doing it for a long time. I wonder if later M chips reduced the effect. And will they drop the feature once they drop Rosetta 2?
I'm really curious how exactly they'll wind up phasing out Rosetta 2. They seem to be a bit coy about it:
> Rosetta was designed to make the transition to Apple silicon easier, and we plan to make it available for the next two major macOS releases – through macOS 27 – as a general-purpose tool for Intel apps to help developers complete the migration of their apps. Beyond this timeframe, we will keep a subset of Rosetta functionality aimed at supporting older unmaintained gaming titles, that rely on Intel-based frameworks.
However, that leaves much unsaid. Unmaintained gaming titles? Does this mean native, old macOS games? I thought many of them were already no longer functional by this point. What about Crossover? What about Rosetta 2 inside Linux?
I wouldn't be surprised if they really do drop some x86 amenities from the SoC at the cost of performance, but I think it would be a bummer of they dropped Rosetta 2 use cases that don't involve native apps. Those ones are useful. Rosetta 2 is faster than alternative recompilers. Maybe FEX will have bridged the gap most of the way by then?
> However, that leaves much unsaid. Unmaintained gaming titles? Does this mean native, old macOS games? I thought many of them were already no longer functional by this point. What about Crossover? What about Rosetta 2 inside Linux?
Apple keeps trying to be a platform for games. Keeping old games running would be a step in that direction. Might include support for x86 games running through wine/apple game porting toolkit/etc
Plenty of interesting details.
> Regarding cache-line size, sysctl on macOS reports a value of 128 B, while getconf and the CTR_EL0 register on Asahi Linux returns 64 B, which is also supported by our measurements.
How would this be even possible?
Cool paper! The authors use the fact that the M1 chip supports both ARM's weaker memory consistency model and x86's total order to investigate the performance hit from using the latter, ceteris paribus.
They see an average of 10% degradation on SPEC and show some synthetic benchmarks with a 2x hit.
This raises questions.
For example, modern x86 architectures still readily out-perform ARM64 in performance-engineered contexts. I don’t think that is controversial. There are a lot of ways to explain it e.g. x86 is significantly more efficient in some other unrelated areas, x86 code is tacitly designed to minimize the performance impact of TSO, or the Apple Silicon implementations nerf the TSO because it isn’t worth the cost to optimize a compatibility shim. TSO must have some value in some contexts, it wasn’t chosen arbitrarily.
Apple Silicon is also an unconventional implementation of ARM64, so I wonder the extent to which this applies to any other ARM64 implementation. I’d like to see more thorough and diverse data. It feels like there are confounding factors.
I think it is great that this is being studied, I’m just not sure it is actionable without much better and more rigorous measurement across unrelated silicon microarchitectures.
The programs that see the most benefit of WO vs TSO are poorly written multithreaded programs. Most of the software you actually use might be higher quality than that?
> TSO must have some value in some contexts, it wasn’t chosen arbitrarily.
Ehhh. I think they might have just backed themselves into it? I believe Intel initially claimed SeqCst but the chips never implemented that and the lack was observable. TSO happened to accurately describe the existing behavior of early multicore Intel chips and they can't exactly relax it now without breaking existing binaries.
Google's AI slop claims Intel published something vague in 2007, and researchers at Cambridge came up with the TSO name and observation in 2009 ("A Better x86 Memory Model: x86-TSO").
https://www.cl.cam.ac.uk/~pes20/weakmemory/x86tso-paper.tpho...
Apple M4 cpu is pretty much kimg in terms of single threaded performance. In multithreaded the M4 ultra of course loses against extreme high core count server CPUs. But I think it's wrong to say that x86 readily outperforms ARM64. Apple essentially dominates in all CPU segments they are in.
But x86_64 does outperform ARM64 in high-performance workloads. High-performance workloads are not single-threaded programs. Maybe if Apple decides one day to manufacture the server CPU, which I believe they will not since they would have to open their chips to Linux. OTOH server aarch64 implementations such Neoverse or Graviton are not as good as x86_64 in terms of absolute performance. Their core design cannot yet compete.
Yeah and outside benchmarks you have to consider the power envelope and platform on top which is definitely out somewhere on its own.
This comment is a two sentence summary of the six sentence Abstract at the very top of the linked article. (Though the paper claims 9%, not 10% -- to three sig figs, so rounding up to 10% is inappropriate.)
Also -- 9% is huge! I am kind of skeptical of this result (haven't yet read the paper). E.g., is it possible ARM's TSO order isn't optimal, providing a weaker relative performance than a TSO native platform like x86?
> An application can benefit from weak MCMs if it distributes its workload across multiple threads which then access the same memory. Less-optimal access patterns might result in heavy cache-line bouncing between cores. In a weak MCM, cores can reschedule their instructions more effectively to hide cache misses while stronger MCMs might have to stall more frequently.
So to some extent, this is avoidable overhead with better design (reduced mutable sharing between threads). The impact of TSO vs WO is greater for programs with more sharing.
> The 644.nab_s benchmark consists of parallel floating point calculations for molecular modeling. ... If not properly aligned, two cores still share the same cache-line as these chunks span over two instead of one cache-line. As shown in Fig. 5, the consequence is an enormous cache-line pressure where one cache-line is permanently bouncing between two cores. This high pressure can enforce stalls on architectures with stronger MCMs like TSO, that wait until a core can exclusively claim a cache-line for writing, while weaker memory models are able to reschedule instructions more effectively. Consequently, 644.nab_s performs 24 percent better under WO compared to TSO.
Yeah, ok, so the huge magnitude observed is due to some really poor program design.
> The primary performance advantage applications might gain from running under weaker memory ordering models like WO is due to greater instruction reordering capabilities. Therefore, the performance benefit vanishes if the hardware architecture cannot sufficiently reorder the instructions (e.g., due to data dependencies).
Read the thing all the way through. It's interesting and maybe useful for thinking about WO vs TSO mode on Apple M1 Ultra chips specifically, but I don't know how much it generalizes.
I’m not an expert… but it seems like it could be even simpler than program design. They note false sharing occurs due to data not being cacheline aligned. Yet when compiling for ARM, that’s not a big deal due to WO. When targeting x86, you would hope the compiler would work hard to align them! So the out of the box compiler behavior could be crucial. Are there extra flags that should be used when targeting ARM-TSO?
False sharing mostly needs to be avoided with program design. I'm not aware of any compiler flags that help here.
I’ve seen the stronger x86 memory model argued as one of the things that affects its performance before.
It’s neat to see real numbers on it. Didn’t seem to be very big in many circumstances which I guess would have been my guess.
Of course Apple just implemented that on the M1 and AMD/Intel had been doing it for a long time. I wonder if later M chips reduced the effect. And will they drop the feature once they drop Rosetta 2?
I'm really curious how exactly they'll wind up phasing out Rosetta 2. They seem to be a bit coy about it:
> Rosetta was designed to make the transition to Apple silicon easier, and we plan to make it available for the next two major macOS releases – through macOS 27 – as a general-purpose tool for Intel apps to help developers complete the migration of their apps. Beyond this timeframe, we will keep a subset of Rosetta functionality aimed at supporting older unmaintained gaming titles, that rely on Intel-based frameworks.
However, that leaves much unsaid. Unmaintained gaming titles? Does this mean native, old macOS games? I thought many of them were already no longer functional by this point. What about Crossover? What about Rosetta 2 inside Linux?
https://developer.apple.com/documentation/virtualization/run...
I wouldn't be surprised if they really do drop some x86 amenities from the SoC at the cost of performance, but I think it would be a bummer of they dropped Rosetta 2 use cases that don't involve native apps. Those ones are useful. Rosetta 2 is faster than alternative recompilers. Maybe FEX will have bridged the gap most of the way by then?
They dropped rosetta 1, what makes you think they will keep supporting this one?
> However, that leaves much unsaid. Unmaintained gaming titles? Does this mean native, old macOS games? I thought many of them were already no longer functional by this point. What about Crossover? What about Rosetta 2 inside Linux?
Apple keeps trying to be a platform for games. Keeping old games running would be a step in that direction. Might include support for x86 games running through wine/apple game porting toolkit/etc