Very interesting and surprising findings by Google with respect to NUMA: http://highscalability.com/blog/2013/5/30/google-finds-numa-up-to-20-slower-for-gmail-and-websearch.html
It is curious that cache contention and NUMA have such an interplay depending on the workload being presented. The most interesting learning is from this paragraph:
“In conclusion, surprisingly, some running scenarios with more remote memory accesses may outperform scenarios with more local accesses due to an increased amount of cache contention for the latter, especially when 100% local accesses cannot be guaranteed. This tradeoff between NUMA and cache sharing/contention varies for different applications and when the application’s corunner changes. The tradeoff also depends on the remote access penalty and the impact of cache contention on a given machine platform. On our Intel Westmere, more often, NUMA has a more signiﬁcant impact than cache contention. This may be due to the fact that this platform has a fairly large shared cache while the remote access latency is as large as 1.73x of local latency. ”
The extremely interesting findings have implications for NUMA-aware thread schedulers in the OS. They would need to compute NUMA policy parameters based on the platform and load characteristics (from CPU performance counters). It might even be pondered whether it makes sense to optionally provide threads the ability to programmatically give NUMA policy hints to the scheduler. That is the thread may declare whether cache sharing or cache contention is more important for it.
Apart from NUMA other system components are also becoming socket-local in order to scale better. Network Interfaces and I/O connections are two recent examples. These considerations from the NUMA study calls for similar studies being done for these other components as well.