14.3 The Go Runtime Scheduler: GOMAXPROCS and Work Stealing

Right, let’s talk about the unsung hero that makes your goroutines actually run without setting your CPU on fire: the Go runtime scheduler. You fire off a million go keywords and just expect it to work, and miraculously, it mostly does. This isn’t magic; it’s a brilliantly engineered piece of software that deserves a moment of your attention.

Think of it this way: your OS scheduler juggles heavyweight threads, which is like trying to manage a construction crew. Context switching is expensive; it involves swapping out huge amounts of memory and CPU state. Now imagine you need to manage a million tiny, independent tasks. Hiring a million OS threads for that is a recipe for your kernel having a panic attack. Go’s solution is to have its own user-space scheduler that multiplexes your potentially millions of goroutines onto a small number of OS threads. It’s the difference between managing that construction crew and managing an army of highly efficient ants. The OS sees a few threads; the Go runtime sees your entire universe of concurrent work.

The M:N Scheduling Model

This is the core concept. Go uses an M:N scheduler, where M goroutines (G) are scheduled onto N OS threads (M, for ‘machine’) managed by a small number of logical processors (P, for ‘processor’). It’s a three-layer cake of indirection.

G (Goroutine): Your lightweight thread. It’s just a function with a stack (that grows and shrinks as needed, which is already genius).
M (Machine): An OS thread. This is the actual thing that gets scheduled by the operating system onto a CPU core. Ms are managed by the Go runtime.
P (Logical Processor): The crucial middleman. A P represents the resources needed to execute Go code, like a local queue of runnable goroutines. The number of Ps is determined by GOMAXPROCS at the start of a program.

The magic is in the binding: at any given time, an M must be attached to a P to execute Go code. The P’s local queue is what it grabs work from. This structure means we avoid a single, global mutex-protected run queue, which would be a massive bottleneck with thousands of goroutines. Instead, we have mostly independent local queues, which is where the real performance wins come from.

GOMAXPROCS: Tuning Your Concurrency Engine

GOMAXPROCS is the knob that controls the maximum number of Ps—and therefore the maximum number of OS threads that can simultaneously execute Go code. I say “simultaneously” because that’s the key. By default, GOMAXPROCS is set to your number of CPU cores. This makes perfect sense: you can’t actually run more goroutines at the exact same moment than you have cores. Concurrency is not parallelism, remember? This setting ensures your program can utilize all available CPU cores for parallel execution.

You can change it at runtime with runtime.GOMAXPROCS(int). Let’s see it in action.

package main

import (
	"fmt"
	"runtime"
	"sync"
	"time"
)

func main() {
	// Let's see the default
	fmt.Printf("Default GOMAXPROCS: %d\n", runtime.GOMAXPROCS(0))
	fmt.Printf("NumCPU: %d\n", runtime.NumCPU())

	// Let's run a CPU-bound task with different settings
	testWithGOMAXPROCS(1)
	testWithGOMAXPROCS(4)
}

func testWithGOMAXPROCS(n int) {
	runtime.GOMAXPROCS(n)
	start := time.Now()

	var wg sync.WaitGroup
	for i := 0; i < 10; i++ {
		wg.Add(1)
		go func() {
			time.Sleep(10 * time.Millisecond) // Simulate a tiny bit of work
			// A mildly CPU-intensive calculation
			var count int
			for i := 0; i < 1e7; i++ {
				count += i % 2
			}
			wg.Done()
		}()
	}
	wg.Wait()

	fmt.Printf("With GOMAXPROCS=%d: %v\n", n, time.Since(start))
}

Running this on a machine with, say, 8 cores, you’ll see a stark difference. With GOMAXPROCS=1, all your goroutines are funneled through a single OS thread, and they run concurrently but not in parallel—they take turns on the one core. With GOMAXPROCS=4, the work is spread across four threads, likely running on four separate cores, finishing much faster. The takeaway? You almost never need to change this from its default. The Go team nailed the default. The main exception is if your code is heavily I/O-bound and you have many more cores than active goroutines, but even then, it’s a subtle tuning exercise, not a first resort.

Work Stealing: Load Balancing Like a Boss

Here’s where the scheduler gets really clever. What happens if one P’s local run queue is empty, but another P has a backlog of goroutines? This is where the “work stealing” algorithm kicks in. An idle P doesn’t just twiddle its thumbs; it becomes a thief.

It checks the other P’s queues and steals half of the goroutines from another P’s runnable queue. This is a brilliant distributed load-balancing mechanism that requires almost no central coordination. It ensures that all CPUs stay busy as long as there is work to be done anywhere in the system. It’s the reason your program efficiently utilizes all the cores you give it without you having to manually partition your workload.

The Blocking Pitfall: Syscalls and the Detached M

This is the most important edge case to understand. What happens when a goroutine makes a blocking system call, like reading from a network socket?

The goroutine (G) blocks. But the OS thread (M) it’s running on also blocks—the kernel has to wait for that I/O operation. If the Go runtime just let that happen, eventually all your Ms could end up blocked, and your program would grind to a halt, even if you had plenty of other goroutines that could run.

Go’s solution is both pragmatic and slightly brutal. When a goroutine is about to block on a syscall, the runtime detaches the entire M (thread) from its P (logical processor). The blocked G is still attached to the now-sleeping M, which is handled by the kernel. The now-idle P is free to find another M to attach to (or create a new one) so it can keep pulling goroutines from its queue and running them. This is why you can have thousands of goroutines waiting on network requests without needing thousands of OS threads.

When the syscall finally completes, the goroutine tries to return. It can’t just jump back on any old P; it needs to find a P to get back into the Go scheduling context. It gets placed on a global run queue. Eventually, some P will pick it up from there and resume its execution.

The moral of the story? Blocking syscalls are still relatively expensive because of this M detachment/reattachment dance. This is the entire reason the net package and others use non-blocking I/O under the hood—to avoid actually blocking an OS thread. It’s also why you should be cautious about using CGO or other code that might make long-running, blocking system calls, as it can temporarily starve a P. The scheduler is brilliant, but it can’t work miracles around the fundamental constraints of the operating system. It just does a shockingly good job of hiding them from you.