May I have Project Loom Clarified?

April 18, 2022

Brian Goetz got me excited about project Loom and, in order to fully appreciate it, I’ll need some clarification on the status quo.

My understanding is as follows: Currently, in order to have real parallelism, we need to have a thread per cpu/core; 1) is there then any point in having n+1 threads on an n-core machine? Project Loom will bring us virtually limitless threads/fibres, by relying on the jvm to carry out a task on a virtual thread, inside the JVM. 2) Will that be truly parallel? 3)How, specifically, will that differ from the aforementioned scenario "n+1 threads on an n-core machine "?

Thanks for your time.

>Solution :

is there then any point in having n+1 threads on an n-core machine?

For one, most modern n-core machines have n*2 hardware threads because each core has 2 hardware threads.

Sometimes it does make sense to spawn more OS threads than hardware threads. That’s the case when some OS threads are asleep waiting for something. For instance, on Linux, until io_uring arrived couple years ago, there was no good way to implement asynchronous I/O for files on local disks. Traditionally, disk-heavy applications spawned more threads than CPU cores, and used blocking I/O.

Will that be truly parallel?

Depends on the implementation. Not just the language runtime, but also the I/O related parts of the standard library. For instance, on Windows, when doing disk or network I/O in C# with async/await (an equivalent of project loom, released around 2016) these tasks are truly parallel, the OS kernel and drivers are indeed doing more work at the same time. AFAIK on Linux async/await is only truly parallel for sockets but not files, for asynchronous file I/O it uses a pool of OS threads under the hood.

How, specifically, will that differ from the aforementioned scenario "n+1 threads on an n-core machine "?

OS threads are more expensive for a few reasons. (1) They require native stack so each OS thread consumes memory (2) Memory is slow, processors have caches to compensate, switching between OS threads increases RAM bandwidth because thread-specific data invalidates after a context switch (3) OS schedulers were improving over decades but still they’re not free. One reason is saving/restoring thread state to/from memory takes time.

The higher-level cooperative multitasking implemented in C# async/await or Java’s Loom causes way less overhead when switching contexts, compared to switching OS threads. At least in theory, this should improve both throughput and latency for I/O heavy applications.