After serious development, CPUs gained many cores, gained distributed blocks of cores on multiple chiplets, numa systems, etc but still a piece of data has to pass through not only L1 cache (if on same core SMT) but also some atomic/mutex synchronization primitive procedure that is not accelerated by hardware.
I wonder why didn’t Intel or Ibm come up with something like:
movcor 1 MX 5 <---- sends 5 to Messaging register of core 1 pipe 1 1 1 <---- pushes data=1 to pipe=1 of core=1 and core1 needs to pop it bcast 1 <--- broadcasts 1 to all cores' pipe-0
to make it much faster than some other methods? GPUs support block-wise fast synchronization points, like barrier() or __syncthreads(). GPUs also support parallel atomic update acceleration for local arrays.
When CPUs gain 256 cores, won’t this feature enable serious scaling for various algorithms that are bottlenecked on core-to-core bandwidth (and/or latency)?
CPUs evolved for a very different programming model than GPUs, to run multiple separate threads, potentially of different processes, so you’d also need software and OS infrastructure to let threads know which other core (if any) some other thread was running on. Or they’d have to pin each thread to a specific core. But even then it would need some way to virtualize the architectural message-passing register, the same way context switches virtualize the standard registers for multi-tasking on each core.
So there’s an extra hurdle before anything like this could even be usable at all under a normal OS, where a single process doesn’t take full ownership of the physical cores. The OS is still potentially scheduling other threads of other processes onto cores, and running interrupt handlers, unlike a GPU where cores don’t have anything else to do and are all build to work together on a massively parallel problem.
A task that wants something like this is usually best done on a GPU anyway, not a few separate deeply pipelined OoO exec CPU cores that are trying to do speculative execution. Unlike GPUs that are simple in-order pipelines.
You couldn’t actually push a result to another core until it retires on the core executing it. Because you don’t want to have to roll back the other core as well if you discover a mis-speculation such as a branch mispredict earlier in the path of execution leading to this. That could conceivably still allow for something lower-latency than bouncing a cache-line between cores for shared memory, but it’s a pretty narrow class of application that can use it.
However, high-performance computing is a known use-case for modern CPUs, so if it was really a game-changer it would be worth considering as a design choice, perhaps.
BTW, for OS use, there is of course an IPI (Inter-Processor Interrupt). But that triggers an interrupt so it very low-performance except to avoid polling by the other side. And to be able to wake up a core from a power-saving sleep state, if more threads are now ready to run so it should wake up can call
schedule() to figure out which one to run.
Any core can send an IPI to any other, if it’s running in kernel mode.
scaling for various algorithms that are bottlenecked on core-to-core bandwidth (and/or latency)?
Mesh interconnects allow pretty large aggregate bandwidth between cores. There isn’t a single shared bus they all have to compete for. Even the ring bus Intel used before Skylake-Xeon, and still uses in client chips, is pipelined and has pretty decent aggregate bandwidth.
Data can be moving between every pair of cores at the same time. (I mean, 128 pairs of cores can each have data in flight in both directions. With some memory-level parallelism, a pipelined interconnect can have multiple cache lines in flight requested by each core.)