Goroutine Leaks in Go and How to Prevent Them

Learn how goroutine leaks happen in Go services and how to prevent them with context cancellation, buffered channels, and clear goroutine ownership.

A goroutine leak happens when a goroutine is started, but never gets a realistic path to finish.

That sounds small because goroutines are cheap. But cheap is not free. A leaked goroutine keeps its stack, references to heap objects, timers, network calls, channel waits, and sometimes file descriptors or database connections alive for longer than intended. One leaked goroutine is usually invisible. One leaked goroutine per request is a production incident waiting patiently.

This article starts with a realistic bug, then builds toward the production-grade habits that prevent it.

The mental model

A goroutine should have an owner.

The owner is the code that starts it, and the owner must know:

  • when the goroutine should stop
  • how it will be told to stop
  • how the caller waits for it, ignores it safely, or lets it run as a deliberate background worker

If you cannot answer those three questions, the goroutine is probably under-designed.

Goroutine leak caused by a timed-out request leaving a sender blocked

A realistic leak

Imagine a checkout service. For each request, it calls a risk service to decide whether the order should be accepted.

The handler wants to return quickly if the client disconnects or the request times out, so it starts the risk check in a goroutine and waits on either the risk result or the request context.

This version looks reasonable at first glance:

package checkout

import (
	"context"
	"encoding/json"
	"net/http"
)

type RiskClient interface {
	Check(ctx context.Context, order Order) (RiskResult, error)
}

type Server struct {
	risk RiskClient
}

type Order struct {
	ID     string
	UserID string
	Total  int64
}

type RiskResult struct {
	Approved bool
	Reason   string
}

type riskResponse struct {
	result RiskResult
	err    error
}

func (s *Server) Checkout(w http.ResponseWriter, r *http.Request) {
	ctx := r.Context()
	order := parseOrder(r)

	riskCh := make(chan riskResponse)

	go func() {
		result, err := s.risk.Check(context.Background(), order)
		riskCh <- riskResponse{result: result, err: err}
	}()

	select {
	case res := <-riskCh:
		if res.err != nil {
			http.Error(w, "risk check failed", http.StatusBadGateway)
			return
		}
		if !res.result.Approved {
			http.Error(w, res.result.Reason, http.StatusForbidden)
			return
		}
		json.NewEncoder(w).Encode(map[string]string{"status": "accepted"})

	case <-ctx.Done():
		http.Error(w, "request cancelled", http.StatusGatewayTimeout)
	}
}

The leak is here:

riskCh := make(chan riskResponse)

go func() {
	result, err := s.risk.Check(context.Background(), order)
	riskCh <- riskResponse{result: result, err: err}
}()

There are two problems.

First, the goroutine uses context.Background(), so it ignores the request cancellation. If the client goes away, the risk check keeps running.

Second, riskCh is unbuffered. A send on an unbuffered channel waits until another goroutine receives. If the handler returns through case <-ctx.Done(), there is no receiver left. When the risk check eventually finishes, the child goroutine blocks forever on:

riskCh <- riskResponse{result: result, err: err}

That goroutine is now leaked.

Why this hurts in production

The bug does not look dramatic in a local test because it only leaks on the timeout path. In production, timeout paths are not rare:

  • clients disconnect
  • load balancers cancel slow requests
  • upstream services become slow during deploys
  • mobile networks drop connections
  • retrying clients create bursts of abandoned work

If each abandoned checkout leaks one goroutine, the service may run fine for hours and then slowly become unhealthy. Memory rises. The scheduler has more goroutines to track. Profiles show thousands of goroutines parked at the same channel send. The original cause may be hidden behind the later symptom: high memory, slow garbage collection, or container restarts.

That is why goroutine leaks are often found with runtime evidence, not by reading the final crash message.

For request-scoped goroutines, use three rules:

  1. Pass the caller’s context.Context into the goroutine’s work.
  2. Give the goroutine a way to finish even if the caller stops waiting.
  3. Bound the number of goroutines if the operation can be triggered many times.

Here is a safer version of the checkout handler:

func (s *Server) Checkout(w http.ResponseWriter, r *http.Request) {
	ctx := r.Context()
	order := parseOrder(r)

	riskCh := make(chan riskResponse, 1)

	go func() {
		result, err := s.risk.Check(ctx, order)

		select {
		case riskCh <- riskResponse{result: result, err: err}:
		case <-ctx.Done():
		}
	}()

	select {
	case res := <-riskCh:
		if res.err != nil {
			http.Error(w, "risk check failed", http.StatusBadGateway)
			return
		}
		if !res.result.Approved {
			http.Error(w, res.result.Reason, http.StatusForbidden)
			return
		}
		json.NewEncoder(w).Encode(map[string]string{"status": "accepted"})

	case <-ctx.Done():
		http.Error(w, "request cancelled", http.StatusGatewayTimeout)
	}
}

The changes are small but important.

riskCh := make(chan riskResponse, 1)

The channel has capacity for one result. If the handler times out right before the child sends, the send can still complete and the goroutine can exit. This is useful when there is exactly one child result and the parent may stop waiting.

result, err := s.risk.Check(ctx, order)

The downstream call receives the request context. This only works if RiskClient.Check respects context cancellation, but that is the contract you want. HTTP clients, database calls, gRPC clients, and most serious Go libraries expose context-aware APIs.

select {
case riskCh <- riskResponse{result: result, err: err}:
case <-ctx.Done():
}

The child goroutine does not insist on sending after the request is gone. It either reports the result or exits when cancellation wins.

For this exact shape, the buffered channel alone is often enough. I still like the select because it documents the ownership rule: this goroutine is request-scoped and should not outlive the request once it can observe cancellation.

An even better option: avoid the goroutine

Before adding any goroutine, ask whether you need one.

If checkout cannot continue without the risk result, this is simpler:

func (s *Server) Checkout(w http.ResponseWriter, r *http.Request) {
	ctx := r.Context()
	order := parseOrder(r)

	result, err := s.risk.Check(ctx, order)
	if err != nil {
		http.Error(w, "risk check failed", http.StatusBadGateway)
		return
	}
	if !result.Approved {
		http.Error(w, result.Reason, http.StatusForbidden)
		return
	}

	json.NewEncoder(w).Encode(map[string]string{"status": "accepted"})
}

This is the best solution when the work is not actually concurrent. The request already runs in its own goroutine inside net/http. Starting a second goroutine just to wait for it immediately is usually accidental complexity.

Use a goroutine when you are doing real overlap:

  • querying risk and inventory at the same time
  • streaming work in the background after the response
  • running a worker loop owned by process lifecycle
  • fan-out to multiple replicas and using the first successful answer

If none of those are true, stay synchronous.

Handling multiple concurrent calls

Advanced bugs often appear when one request fans out to many goroutines.

Suppose checkout asks three independent systems: risk, inventory, and promotions. This is a reasonable use of concurrency, but you still want cancellation and waiting to be structured.

The errgroup package is a good fit:

import "golang.org/x/sync/errgroup"

func (s *Server) validateOrder(ctx context.Context, order Order) error {
	g, ctx := errgroup.WithContext(ctx)
	g.SetLimit(3)

	g.Go(func() error {
		_, err := s.risk.Check(ctx, order)
		return err
	})

	g.Go(func() error {
		return s.inventory.Reserve(ctx, order)
	})

	g.Go(func() error {
		return s.promotions.Apply(ctx, order)
	})

	return g.Wait()
}

errgroup.WithContext gives the group a derived context. If one function returns an error, the context is canceled and the other functions get a signal to stop. g.Wait() makes ownership explicit: the caller waits until all started functions have returned.

SetLimit matters when the fan-out count can grow. If a request can start one goroutine per cart item, one goroutine per tenant, or one goroutine per downstream shard, you should usually put a ceiling on it.

Common leak patterns

Sending with no receiver

This is the checkout bug:

ch := make(chan Result)

go func() {
	ch <- slowWork()
}()

select {
case result := <-ch:
	use(result)
case <-ctx.Done():
	return ctx.Err()
}

If ctx.Done() wins, the child may block forever trying to send.

Use a buffered channel of size one, cancellation-aware send, or structured waiting with errgroup.

Receiving from a channel nobody will close

This worker never exits if jobs is never closed and no cancellation path exists:

go func() {
	for job := range jobs {
		process(job)
	}
}()

A long-lived worker is fine, but it needs a lifecycle owner:

go func() {
	for {
		select {
		case job, ok := <-jobs:
			if !ok {
				return
			}
			process(job)
		case <-ctx.Done():
			return
		}
	}
}()

Tickers that are never stopped

This leaks both the goroutine and the ticker:

go func() {
	ticker := time.NewTicker(time.Minute)
	for range ticker.C {
		refreshCache()
	}
}()

Prefer:

go func() {
	ticker := time.NewTicker(time.Minute)
	defer ticker.Stop()

	for {
		select {
		case <-ticker.C:
			refreshCache()
		case <-ctx.Done():
			return
		}
	}
}()

Background work with no shutdown

Some goroutines are supposed to live longer than one request: queue consumers, metrics loops, cache refreshers, subscription readers.

That is fine. They still need a process-level owner. In a server, that usually means a root context that is canceled during shutdown and a WaitGroup or errgroup that confirms the workers exited before the process stops.

How to find goroutine leaks

For beginners, the first clue is usually a goroutine count that only moves up.

You can expose Go’s pprof handlers in internal environments:

import _ "net/http/pprof"

Then inspect goroutine stacks:

curl http://localhost:6060/debug/pprof/goroutine?debug=2

What you are looking for is not just “many goroutines”. Some services naturally have many goroutines. The suspicious pattern is many goroutines blocked at the same source line:

goroutine 48291 [chan send]:
checkout.(*Server).Checkout.func1(...)
	/app/checkout/handler.go:42

For regular metrics, track the goroutine count. In modern Go, you can read it through runtime metrics:

samples := []metrics.Sample{{Name: "/sched/goroutines:goroutines"}}
metrics.Read(samples)

In tests, Uber’s goleak package is useful for catching goroutines that remain after a test finishes:

func TestCheckoutDoesNotLeakOnTimeout(t *testing.T) {
	defer goleak.VerifyNone(t)

	// run the timeout path here
}

Leak tests are especially valuable around code that uses channels, timers, retries, streaming APIs, and background workers.

A practical checklist

When reviewing Go code that starts goroutines, ask these questions:

  • Who owns this goroutine?
  • What makes it return?
  • Does it observe context.Context or another shutdown signal?
  • Can it block forever on a channel send or receive?
  • If the caller returns early, what happens to the child goroutine?
  • If this runs once per request, is there a concurrency limit?
  • Is there a test for the timeout, cancellation, and error paths?

The advanced version is the same checklist with less optimism. Every select branch, early return, timeout, retry, and channel close is part of the lifecycle contract.

Final thought

Goroutine leaks are not a Go beginner problem. They happen because Go makes concurrency easy to start, while lifecycle ownership still has to be designed.

The recommended default is simple:

Do not start a goroutine unless you know who owns it, how it stops, and how the rest of the program waits for it or safely stops caring.

For request-scoped work, prefer synchronous code first. When concurrency is useful, pass context, avoid unbounded fan-out, make channel sends safe when the receiver may leave, and use errgroup or a clear worker lifecycle where it fits.

References