Skip to main content

Command Palette

Search for a command to run...

Adding HTTP Probes to Talos Linux

Updated
28 min read
Adding HTTP Probes to Talos Linux

Historically, isolating network configuration issues when deploying new clusters has been difficult. Talos’s existing TCPProbeConfig checks reachability by attempting direct TCP connections, which fails in environments where egress is routed through an HTTP proxy even though the endpoint is reachable via the proxy.

This post walks through a feature addition: HTTPProbeConfig, which targets HTTP endpoints, succeeds for HTTP responses in the 200–399 range, and honors node proxy settings (HTTP_PROXY, HTTPS_PROXY, NO_PROXY). It also integrates with Talos’s COSI state machine the same way TCP probes do, producing probe resources you can query.
I’ll cover what existed before, why HTTP probes were needed, every layer of code that changed, and how the feature is surfaced and tested.
(See Issue: https://github.com/siderolabs/talos/issues/12952 and PR: https://github.com/siderolabs/talos/pull/13082.)

Proposed Solution:

  • Add a HTTPProbeConfig that accepts an HTTP(s) endpoint as target and succeeds if the server responds with code 200–399.

  • Respects the node’s proxy configuration HTTP_PROXY, HTTPS_PROXY, NO_PROXY, etc.), so probes succeed in proxy-gated environments.

  • Integrates with Talos’s COSI state machine the same way TCP probes do, producing probe resources you can query.

Background: what are network probes?

Probes are useful for detecting network misconfiguration, validating service availability, and driving automated remediation workflows. Talos nodes can continuously monitor connectivity to endpoints. You define a probe in machine config which says "check this address every 5 seconds" and the node periodically tests it and stores the result as a resource you can query.

talosctl --nodes 172.20.0.2 get probestatus
NODE         NAMESPACE   TYPE          ID                    VERSION   SUCCESS
172.20.0.2   network     ProbeStatus   tcp:172.20.0.2:6443   1         true

Before this change, only TCP probes existed. You could check is this port open? but you couldn't ask is the server actually responding to HTTP requests over that port, and is TLS configured correctly? That's the gap HTTP probes fill.

The Two probe types: what's the actual difference?

Both probes connect to a server. But they stop at different depths:

This means you can probe the same port and get different answers:

TCP  tcp:172.20.0.2:6443     → SUCCESS  (connection accepted)
HTTP http://172.20.0.2:6443  → SUCCESS  (server responded)
HTTP https://172.20.0.2:6443 → FAILURE  (self-signed cert, not trusted)

All three hit the same wire, same IP, same port. The difference is how deep the check goes.
HTTP Probes :

  • Reflect actual application level connectivity (DNS + proxy + HTTP routing).

  • Provide meaningful signals in environments where egress is only possible via an HTTP proxy.

The Code: a 5-layer onion

Talos is structured in clean layers. Each layer has a single job and only talks to adjacent layers through well-defined interfaces. Adding HTTP probes meant touching all five:

Layer 5:  Integration test

internal/integration/api/probe-config.go

Layer 4:  Probe runner

internal/probe/probe.go 

Layer 3:  COSI controller

controllers/network/probe_config.go 

Layer 2:  Config document

config/types/network/http_probe.go

Layer 1:  Resource / Proto              

resources/network/probe_spec.go

But first, it's important to know more about COSI. Skip the following explanation if you're familiar with the Kubernetes resource-controller model.

What is COSI?

COSI (Controller Oriented System Interface) is Talos's internal state machine. It has the same pattern as Kubernetes: resources + controllers, but running entirely inside a single machined process. Think of COSI as an in-memory event-driven machine/database.

Resources

Resources are named pieces of data sitting in memory. Every resource has 4 things (Stated as an analogy with databases) :

  • Namespace: You can think of this as the database

    • Two resources can have the same Type and ID as long as they're in different namespaces.

      // pkg/machinery/resources/network/network.go
      const NamespaceName       resource.Namespace = "network"          // final merged state
      const ConfigNamespaceName resource.Namespace = "network-config"   // per-source drafts
      
  • Type : Can think of type as the table name

    • This is a string constant. The full DNS-style name is convention, it prevents collisions between different Talos subsystems.

      // pkg/machinery/resources/network/probe_spec.go
      const ProbeSpecType = resource.Type("ProbeSpecs.net.talos.dev")
      
  • ID : Identifier/the unique key for this specific instance, within (namespace, type)

    • For a probe at https://example.com, the ID is "http:https://example.com". This string is derived from the spec itself

      // probe_spec.go
      func (spec *ProbeSpecSpec) ID() (resource.ID, error) {
          if spec.HTTP != (HTTPProbeSpec{}) {
              return fmt.Sprintf("http:%s", spec.HTTP.URL), nil
          }
          ...
      }
      
  • TypedSpec: Actual Go struct holding the payload

Namespace: network
Type:      ProbeSpecs.net.talos.dev
ID:        configuration/tcp:proxy.example.com:3128
Spec:      ProbeSpecSpec{ Interval: 1s, FailureThreshold: 3, TCP:       { Endpoint: "proxy.example.com:3128", Timeout: 10s }, ConfigLayer: configuration }

Specifically talking about this change, probeSpec would be the resource

Why There Are Two Network Namespaces?

The probe subsystem uses three namespaces:

Namespace constant String What lives there
config.NamespaceName "config" This is the Generic namespace which holds the raw MachineConfig document.
network.ConfigNamespaceName "network-config" Holds per source ProbeSpec resources , example from machine config.
network.NamespaceName "network" Merged final ProbeSpec resources + ProbeStatus results

The reason for having both network-config and network namespaces is layered configuration. Talos Linux supports the idea that multiple sources (machine config, operator, defaults, etc.) could each supply a version of the same resource. Initially, they're written into network-config with a prefix indicating their source. A Merge controller then combines them into the final canonical resource in network namespace.

When ProbeConfigController writes a probe spec, it uses LayeredID:

// network/network.go
func LayeredID(layer ConfigLayer, id string) string {
    return fmt.Sprintf("%s/%s", layer, id)
    // "configuration" + "/" + "http:https://example.com"
    // → "configuration/http:https://example.com"
}

So in network-config the ID is "configuration/http:https://example.com".
After merging, in network it becomes just "http:https://example.com".

This matters because ProbeController reads from network, not network-config so it always sees the final merged view.

Controllers

Controllers are goroutines that declare what resources they read (Inputs) and what they write (outputs), then react to change events:

Controller: ProbeConfigController
  Input (weak): config.MachineConfigType (ID=active)
  Output (shared): network.ProbeSpecType

Controller: ProbeController
  Input (weak): network.ProbeSpecType (all IDs)
  Output (exclusive): network.ProbeStatusType

Every controller implements three methods plus a run loop:

Name()    string               // unique string, for logging/debugging
Inputs()  []controller.Input   // what I watch
Outputs() []controller.Output  // what I own and write
Run(ctx, runtime, logger)      // the actual loop

Weak input: the controller sees the resource but doesn't own it.
Strong input: the controller holds a finalizer, the resource can't be deleted until released. For config-reading controllers, Weak is always correct since there's no reason to block deletion of the user's config document.
Let's look at ProbeConfigController specifically:

// probe_config.go
func (ctrl *ProbeConfigController) Inputs() []controller.Input {
    return []controller.Input{{
        Namespace: config.NamespaceName,          // "config" namespace
        Type:      config.MachineConfigType,      // "MachineConfigs.config.talos.dev"
        ID:        optional.Some(config.ActiveID), // specifically the "v1alpha1" instance
        Kind:      controller.InputWeak,
    }}
}

func (ctrl *ProbeConfigController) Outputs() []controller.Output {
    return []controller.Output{{
        Type: network.ProbeSpecType,       // "ProbeSpecs.net.talos.dev"
        Kind: controller.OutputShared,    // other controllers can also write ProbeSpecs
    }}
}

"Shared" output means multiple controllers can write to the same output type.
"Exclusive" output means only this controller may write .

Why COSI?

Talos runs dozens of concurrent controllers which can do multiple tasks like managing network links, resolving DNS, applying firewall rules, running probes, and more. As the system grows, the most common approach is to have these components call each other directly: "controller A needs data from controller B, so A calls B". This creates a cascade of problems:

  • Tight Coupling

  • Race Conditions

  • Startup ordering issues

  • No Automatic retry on failure

COSI solves this by making all shared state live in a central in-memory store, and having every component only ever read and write to that store. COSI runtime re-runs them automatically whenever their inputs change. So no controller talks to another controller directly.

The Event Loop : How a Controller Actually Runs

The Event loop i.e. reaction to change events is key point in controller as well

Note: The Controller always reads the full current state and not a diff.

// probe_config.go
func (ctrl *ProbeConfigController) Run(ctx context.Context, r controller.Runtime, ...) error {
    for {
        select {
        case <-ctx.Done():
            return nil          // machined is shutting down
        case <-r.EventCh():     // ← COSI fires this when any Input changed
        }

        // Now reconcile: read inputs, compute outputs, write outputs
        r.StartTrackingOutputs()
        ...write ProbeSpecs...
        r.CleanupOutputs(ctx, ...)   // delete ProbeSpecs not written this round
        r.ResetRestartBackoff()
    }
}

The COSI runtime calls Run() once at the startup and keeps it alive for the lifetime of machined, every controller sits in a for loop calling r.EventCh() , when any input resource changes, the channel fires and controller re-runs its logic. It just knows something changed, go reconcile. This is intentional it keeps the reconcile logic simple (always recompute from scratch) and avoids complex diffing.

StartTrackingOutputs() tells the runtime "start remembering every output I write from now on."
CleanupOutputs() then destroys any output resources that the controller owns but didn't write during this cycle.

Reading and Writing — The safe Package

COSI provides type-safe wrappers so you never deal with raw interface{}:

Reading a single known resource:

safe.ReaderGetByID : READ a single resource by ID.

cfg, err := safe.ReaderGetByID[*config.MachineConfig](ctx, r, config.ActiveID)
// cfg is *config.MachineConfig, fully typed
// if not found, err is a "not found" error you can check with state.IsNotFoundError(err)

Reading all resources of a type:

safe.ReaderListAll : READ all resources of a type

specList, err := safe.ReaderListAll[*network.ProbeSpec](ctx, r)
for probeSpec := range specList.All() {
    // probeSpec is *network.ProbeSpec
    probeSpec.Metadata().ID()   // the ID string
    probeSpec.TypedSpec()       // *ProbeSpecSpec
}

Writing (upsert : create if missing, update if exists):

safe.WriterModify(ctx, r,
    network.NewProbeSpec(network.ConfigNamespaceName, "configuration/http:https://example.com"),
    func(r *network.ProbeSpec) error {
        *r.TypedSpec() = spec   // overwrite the entire spec
        return nil
    },
)

WriterModify does optimistic concurrency internally. It is COSI's compare and swap i.e. it reads the existing resource, applies your function, and only persists if the result differs from what's already stored. This prevents infinite reconcile loops.

We talked about reading & writing resources, but what if existing resources from previous state are absent in current state ?

We can use r.StartTrackingOutputs() + r.CleanupOutputs(), which tracks everything written in this cycle and deletes anything that was tracked from before but wasn't rewritten this cycle.

Now let's go back to code and start with layer 1

Resource/Proto

Talos Linux uses Protocol Buffers for its COSI resources. But unlike most projects we don't manually write the .proto file. It is auto generated artifact not the source of truth.
The Real source of truth is the Go struct in this case probe_spec.go , we update the COSI resource here, which is ProbeSpecSpec(description of the probe).

Running make generate triggers structprotogen, which scans structs annotated with //gotagsrewrite:gen, reads their protobuf:"N" tags, and generates network.proto. Then protoc generates network.pb.go and network_vtproto.pb.go.

probe_spec.go   ----->    network.proto  ----->  network.pb.go                                                                                            //gotagrewrite: gen       (generated by         (generated by
//protobuf:"N" tags          structprotogen)      protoc)

Changes: pkg/machinery/resources/network/probe_spec.go

Before:

type ProbeSpecSpec struct {
    Interval         time.Duration
    FailureThreshold int
    TCP              TCPProbeSpec
    ConfigLayer      ConfigLayer
}

After : we added HTTPProbeSpec and a new HTTP field:

// New struct for HTTP-specific fields
type HTTPProbeSpec struct {
    URL     string        `yaml:"url"     protobuf:"1"`
    Timeout time.Duration `yaml:"timeout" protobuf:"2"`
}

type ProbeSpecSpec struct {
    Interval         time.Duration `yaml:"interval"         protobuf:"1"`
    FailureThreshold int           `yaml:"failureThreshold" protobuf:"2"`
    TCP              TCPProbeSpec  `yaml:"tcp,omitempty"    protobuf:"3"`
    ConfigLayer      ConfigLayer   `yaml:"layer"            protobuf:"4"`
    HTTP             HTTPProbeSpec `yaml:"http,omitempty"   protobuf:"5"` // NEW
}

Only one of TCP or HTTP is ever populated. The zero value of the other is what the runner uses to decide which probe to run , more on this in Layer 4.

Config Layer

What is the config layer?

The Config layer is how a Talos user expresses intent. You write a YAML document, apply it to a node, and Talos parses it into typed Go structs. These structs are what the rest of the system reads and knows what to do.
Talos has multi-document system, instead of one giant machine.yaml, it supports many seperate typed documents.
What this means is say a user can write following config:

apiVersion: v1alpha1
kind: TCPProbeConfig
name: proxy-check
endpoint: proxy.example.com:3128
interval: 5s
timeout: 10s

Which essentially says to create a TCP probe config on endpoint proxy.example.com:3128, after writing this config document a user can apply/patch this as a machine config.
Once we apply this config, talos controller picks up and writes this as a COSI resource, for probe specific changes there are 3 controllers which we will talk about later.
HTTPProbe Config

apiVersion: v1alpha1
kind: HTTPProbeConfig
name: my-probe
interval: 5s
failureThreshold: 3
url: https://example.com
timeout: 10s

Think of this as when the user writes a yaml document of kind HTTPProbeConfig our systems needs golang structs to validate and occupy contents of this document. Once validated, we need to add this as a cosi resource in config, A controller picks this up, writes it as a COSI resource in network-config, and after merging it becomes the resolved resource in the network namespace.

The Interface contract

Before writing any struct, we define the interface that controllers will use to talk to an HTTP probe config document. Controllers in Talos never import concrete config types, they work through interfaces defined in network.go (in this case). This keeps the controller layer fully decoupled from how config is stored or versioned.

The existing TCP probe interfaces are:

type NetworkCommonProbeConfig interface {
    NamedDocument           // Name() string
    Interval() time.Duration
    FailureThreshold() int
}

type NetworkTCPProbeConfig interface {
    NetworkCommonProbeConfig
    Endpoint() string
    Timeout() time.Duration
}

We add a parallel interface for HTTP:

type NetworkHTTPProbeConfig interface {
    NetworkCommonProbeConfig
    URL() string
    Timeout() time.Duration
}

This is the contract our config document struct must satisfy, and it's what the ProbeConfigController will type-switch on when translating config documents into COSI resources.
Now let's write our config for HTTPProbe

The Registry : How kind maps to a Go struct

When Talos parses a config document it looks up kind in a global registry. Each kind registers itself at startup via an init() in its own file:

const HTTPProbeKind = "HTTPProbeConfig"

func init() {
    registry.Register(HTTPProbeKind, func(version string) config.Document {
        switch version {
        case "v1alpha1":
            return &HTTPProbeConfigV1Alpha1{}
        default:
            return nil
        }
    })
}

This lets Talos support multiple API versions of the same document kind without large switch statements anywhere else.

CommonProbeConfig - shared embedding with TCP

Before looking at the full HTTP struct it's worth understanding CommonProbeConfig. Rather than duplicating interval and failureThreshold across every probe type, both TCPProbeConfigV1Alpha1 and HTTPProbeConfigV1Alpha1 embed this shared struct:

type CommonProbeConfig struct {
    ProbeInterval         time.Duration `yaml:"interval"`
    ProbeFailureThreshold int `yaml:"failureThreshold"`
}

func (c *CommonProbeConfig) Interval() time.Duration { 
    return c.ProbeInterval
}
func (c *CommonProbeConfig) FailureThreshold() int   { 
    return c.ProbeFailureThreshold 
}

By embedding without a field name , all of its methods appear directly on the outer struct. Both HTTPProbeConfigV1Alpha1.Interval() and TCPProbeConfigV1Alpha1.Interval() come from the same place, no duplication.

The Config document struct

HTTPProbeConfigV1Alpha1 is the Go struct that backs the YAML document:

type HTTPProbeConfigV1Alpha1 struct {
    meta.Meta      `yaml:",inline"`  // holds apiVersion + kind
    MetaName  string  `yaml:"name"`  // Name of the Probe.
    CommonProbeConfig `yaml:",inline"`// embedded: interval, failureThreshold
    HTTPEndpoint string        `yaml:"url"` // HTTP/HTTPS Url to probe to.
    HTTPTimeout  time.Duration `yaml:"timeout,omitempty"` // Timeout for the Probe.
}

A Constructor sets the kind and apiVersion fields so callers don't have to fill them manually, this is also what the registry factory returns:

func NewHTTPProbeConfigV1Alpha1(name string) *HTTPProbeConfigV1Alpha1 {
    return &HTTPProbeConfigV1Alpha1{
        Meta: meta.Meta{
            MetaKind:       HTTPProbeKind,
            MetaAPIVersion: "v1alpha1",
        },
        MetaName: name,
    }
}

The struct then implements other interfaces. NamedDocument and config.Document that are required by every document in talos:

// Name implements config.NamedDocument - used to identify the probe by name
func (p *HTTPProbeConfigV1Alpha1) Name() string {
    return p.MetaName
}

// Clone implements config.Document - delegates to the generated DeepCopy
func (p *HTTPProbeConfigV1Alpha1) Clone() config.Document {
    return p.DeepCopy()
}

DeepCopy() is generated code (covered later).
It then implements the remaining methods to fully satisfy NetworkHTTPProbeConfig:

func (p *HTTPProbeConfigV1Alpha1) Name() string { return p.MetaName }

func (p *HTTPProbeConfigV1Alpha1) URL() string { return p.HTTPEndpoint }

func (p *HTTPProbeConfigV1Alpha1) Timeout() time.Duration {
    if p.HTTPTimeout == 0 {
        return 10 * time.Second // default
    }
    return p.HTTPTimeout
}

Validation

Validation runs at patch mc time, the node rejects a bad document before it ever reaches the probe machinery:

func (p *HTTPProbeConfigV1Alpha1) Validate() error {
    var errs []error

    if p.MetaName == "" {
        errs = append(errs, errors.New("name is required"))
    }

    if p.HTTPEndpoint == "" {
        errs = append(errs, errors.New("url is required"))
    } else {
        u, _ := url.Parse(p.HTTPEndpoint)
        if u.Scheme != "http" && u.Scheme != "https" {
            errs = append(errs, errors.New("url must have http or https scheme"))
        }
    }

    if p.HTTPTimeout < 0 {
        errs = append(errs, errors.New("timeout must be non-negative"))
    }

    return errors.Join(errs...)
}

A User applying a document with a missing url or a non-http scheme gets an error immediately at the CLI.

The Container - aggregating multiple documents

Container is the runtime representation of all config documents currently applied to a node. When a user applies multiple documents they all live in the same container. A Generic helper walks the document list and returns only those satisfying a given interface:

func findMatchingDocs[T any](documents []config.Document) []T {
    var result []T
    for _, doc := range documents {
        if c, ok := doc.(T); ok {
            result = append(result, c)
        }
    }
    return result
}

For probes the container exposes:

func (container *Container) NetworkProbeConfigs() []config.NetworkCommonProbeConfig {
    return findMatchingDocs[config.NetworkCommonProbeConfig](container.documents)
}

This returns every TCPProbeConfig and HTTPProbeConfig document in a single slice because both implement NetworkCommonProbeConfig. A User can have any number of probe documents of mixed types. The ProbeConfigController then type-switches on each one to figure out whether to build a TCP or HTTP spec:

switch cfg := probeConfig.(type) {
case configconfig.NetworkTCPProbeConfig:
    spec.TCP = network.TCPProbeSpec{...}
case configconfig.NetworkHTTPProbeConfig:
    spec.HTTP = network.HTTPProbeSpec{...}
}

This is the handoff point from the config layer to the controller layer.

Code generation for the config layer

pkg/machinery/config/types/network/network.go has //go:generate directives that run two tools over the config types:

  • docgen → generates network_doc.go - field descriptions used by talosctl docs and the website reference pages.

  • deep-copy → generates deep_copy.generated.go - DeepCopy() methods so Talos can safely clone config documents when merging or passing between controllers.

So, we'll add http_probe.go and HTTPProbeConfigV1Alpha1 -type and run make generate and make docs.

Controllers

What is Controller Layer ?

Controllers are Talos's reconcile loop pattern. Each controller declares what resources it reads (inputs), what resources it writes (outputs), and runs an event-driven loop reacting to changes.

Once a validated HTTPProbeConfig document is stored in the config container, the controller layer takes over. There are three controllers involved, each with a clear responsibility:

ProbeConfigController - the bridge from config to COSI

This controller is the bridge between the config layer and the COSI resource world. It reads the MachineConfig resource and writes ProbeSpec resources into the network-config namespace.

Input:  MachineConfig resource  (config namespace)
Output: ProbeSpec resources     (network-config namespace)

When Run() fires it:

  1. Reads the MachineConfig resource from COSI

  2. Extracts all probe configs via cfg.Config().NetworkProbeConfigs()

  3. For each probe config, builds a ProbeSpecSpec with the right fields

  4. Upserts it into COSI (create if new, update if changed)

  5. Deletes any ProbeSpec in network-config no longer present in config (StartTrackingOutputs + CleanupOutputs handles this)

The only change we made to this file was adding one case to the type switch in parseMachineConfiguration:

internal/app/machined/pkg/controllers/network/probe_config.go

Before:

switch probeConfig := probeConfig.(type) {
case configconfig.NetworkTCPProbeConfig:
    spec.TCP = network.TCPProbeSpec{
        Endpoint: probeConfig.Endpoint(),
        Timeout:  probeConfig.Timeout(),
    }
default:
    panic(fmt.Sprintf("unsupported probe config type: %T", probeConfig))
    // ↑ would panic if it ever saw an HTTPProbeConfig
}

After:

switch probeConfig := probeConfig.(type) {
case configconfig.NetworkTCPProbeConfig:
    spec.TCP = network.TCPProbeSpec{
        Endpoint: probeConfig.Endpoint(),
        Timeout:  probeConfig.Timeout(),
    }
case configconfig.NetworkHTTPProbeConfig:   // ← our addition
    spec.HTTP = network.HTTPProbeSpec{
        URL:     probeConfig.URL(),
        Timeout: probeConfig.Timeout(),
    }
default:
    panic(fmt.Sprintf("unsupported probe config type: %T", probeConfig))
}

probeConfig arrives as the NetworkCommonProbeConfig interface. The type switch checks its concrete type at runtime and extracts the right fields. The default panic ensures that if someone adds a third probe type later and forgets to update this switch, the failure is loud and immediate rather than silently ignored.

After this runs, COSI state contains a ProbeSpec resource like:

ID:               "configuration/http:https://example.com"
Interval:         5s
FailureThreshold: 3
TCP:              {Endpoint:"", Timeout:0}
HTTP:             {URL:"https://example.com", Timeout:10s}
ConfigLayer:      MachineConfiguration

Only one of TCP or HTTP is ever populated, the zero value of the other is what the runner uses downstream to decide which probe to run.

ProbeMergeController - merging all sources

ProbeConfigController uses OutputShared, meaning it is not the only possible source of ProbeSpec resources in network-configProbeMergeController reconciles all the sources.

Input:  ProbeSpec resources  (network-config namespace — all sources)
Output: ProbeSpec resources  (network namespace — final resolved)

It sorts specs by ConfigLayer priority, then builds the final map:

func NewProbeMergeController() controller.Controller {
    return GenericMergeController(
        network.ConfigNamespaceName,
        network.NamespaceName,
        func(logger *zap.Logger, list safe.List[*network.ProbeSpec]) map[resource.ID]*network.ProbeSpecSpec {
            list.SortFunc(func(left, right *network.ProbeSpec) int {
                return cmp.Compare(left.TypedSpec().ConfigLayer, right.TypedSpec().ConfigLayer)
            })

            probes := make(map[string]*network.ProbeSpecSpec, list.Len())
            for probe := range list.All() {
                id, _ := probe.TypedSpec().ID()
                probes[id] = probe.TypedSpec()
            }
            return probes
        },
    )
}

It's built on GenericMergeController- a shared pattern used across all Talos network controllers (addresses, routes, DNS resolvers all have equivalent *_merge.go files). We made zero changes here, it works generically over ProbeSpecSpec so HTTP probes flow through automatically.

ProbeController - specs to goroutines

This Controller watches the final resolved ProbeSpec resources in the network namespace and manages a map of probe.Runner goroutines, one per probe. We made zero changes here either.

Input:  ProbeSpec resources   (network namespace)
Output: ProbeStatus resources (network namespace)

Run() listens on two channels:

for {
    select {
    case <-r.EventCh():
        ctrl.reconcileRunners(...)   // sync running goroutines with desired ProbeSpecs
    case ev := <-notifyCh:
        ctrl.reconcileOutputs(...)   // write ProbeStatus from runner result
    }
}

reconcileRunners diffs desired state against currently running goroutines:

// Stop runners that no longer have a matching ProbeSpec
for id := range ctrl.runners {
    if _, exists := shouldRun[id]; !exists {
        ctrl.runners[id].Stop()
        delete(ctrl.runners, id)
    } else if !shouldRun[id].Equal(ctrl.runners[id].Spec) {
        ctrl.runners[id].Stop()  // spec changed — restart with new spec
        delete(ctrl.runners, id)
    }
}

// Start runners for new ProbeSpecs
for id := range shouldRun {
    if _, exists := ctrl.runners[id]; !exists {
        ctrl.runners[id] = &probe.Runner{
            ID:   id,             // "http:https://example.com"
            Spec: shouldRun[id], // full ProbeSpecSpec — controller doesn't inspect TCP vs HTTP
        }
        ctrl.runners[id].Start(ctx, notifyCh, logger)
    }
}

The Controller passes the entire ProbeSpecSpec to the runner, it doesn't care which fields are set. That's the runner's responsibility.

When a runner sends a result back through notifyCh, reconcileOutputs writes it as a ProbeStatus resource:

safe.WriterModify(ctx, r, network.NewProbeStatus(network.NamespaceName, ev.ID),
    func(status *network.ProbeStatus) error {
        *status.TypedSpec() = ev.Status  // {Success: true/false, LastError: "..."}
        return nil
    })

This is what talosctl get probestatuses reads.

The Actual HTTP Connection is made through runner in layer 4.

Layer 4 - The Probe Runner

This is the goroutine that actually does the network check. ProbeController (Layer 3) creates one Runner per ProbeSpec and calls Start().
File : internal/app/machined/pkg/controllers/network/internal/probe/probe.go

The Runner type

type Runner struct {
    ID   string                  // e.g. "http:https://example.com"
    Spec network.ProbeSpecSpec   // full spec including TCP and HTTP fields

    cancel context.CancelFunc
    wg     sync.WaitGroup
}

Start(ctx, notifyCh, logger) spawns a goroutine running run().
Stop() cancels the context and waits for the goroutine to exit via wg.Wait().

The Run loop

func (runner *Runner) run(ctx context.Context, notifyCh chan<- Notification, logger *zap.Logger) {
    ticker := time.NewTicker(runner.Spec.Interval)
    defer ticker.Stop()

    consecutiveFailures := 0
    firstIteration := true

    for {
        if !firstIteration {
            select {
            case <-ctx.Done():
                return
            case <-ticker.C:
            }
        }
        firstIteration = false

        err := runner.probe(ctx)
        if err == nil {
            consecutiveFailures = 0
            notifyCh <- Notification{ID: runner.ID, Status: network.ProbeStatusSpec{Success: true}}
            continue
        }

        consecutiveFailures++
        if consecutiveFailures < runner.Spec.FailureThreshold {
            continue // not failed enough yet — stay silent
        }

        notifyCh <- Notification{
            ID:     runner.ID,
            Status: network.ProbeStatusSpec{Success: false, LastError: err.Error()},
        }
    }
}

Three things worth noting:

  • First iteration runs immediately: no waiting for the first tick. A Newly configured probe yields a result as fast as possible.

  • FailureThreshold debouncing: A Single failed attempt doesn't flip the status. Only after FailureThreshold consecutive failures does the runner emit a Success: false notification. This prevents flapping on transient blips.

  • One success resets the counter: recovery is immediate, a single passing probe undoes any number of accumulated failures.

The runner pushes results into notifyCh, which ProbeController reads to update the ProbeStatus resource (Layer 3).

The Dispatcher

func (runner *Runner) probe(ctx context.Context) error {
    switch {
    case runner.Spec.TCP != (network.TCPProbeSpec{}):
        return runner.probeTCP(ctx)
    case runner.Spec.HTTP != (network.HTTPProbeSpec{}): // ← our addition
        return runner.probeHTTP(ctx)
    default:
        return errors.New("no probe type specified")
    }
}

The dispatch is a check that identifies which field is non-zero. No type enum needed only one of TCP or HTTP is ever populated (Layer 3 guarantees this) and the zero value of a struct is a reliable discriminator. Adding HTTP support was just one extra case.

probeHTTP - the actual HTTP request

probeHTTP builds a fresh http.Client on every call no shared state, no connection pool reuse across probe iterations.

func (runner *Runner) probeHTTP(ctx context.Context) error {
    client := &http.Client{
        Transport: httpdefaults.PatchTransport(cleanhttp.DefaultTransport()),
        CheckRedirect: func(*http.Request, []*http.Request) error {
            return http.ErrUseLastResponse
        },
    }

    ctx, cancel := context.WithTimeout(ctx, runner.Spec.HTTP.Timeout)
    defer cancel()

    req, err := http.NewRequestWithContext(ctx, http.MethodGet, runner.Spec.HTTP.URL.String(), nil)
    if err != nil {
        return err
    }

    resp, err := client.Do(req)
    if err != nil {
        return err // connection refused, TLS error, DNS fail, timeout
    }

    if resp.StatusCode < http.StatusOK || resp.StatusCode >= http.StatusBadRequest {
        return errors.New("received non-success status code: " + resp.Status)
    }

    return resp.Body.Close()
}

Four deliberate design choices:

1. httpdefaults.PatchTransport for proxy support. This sets transport.Proxy = httpproxy.FromEnvironment(). The key word is FromEnvironment() which means proxy environment variables are re-read on every request, not cached at startup. So a probe targeting an endpoint behind a proxy will start working as soon as the proxy env var is set, no restart needed.

2. cleanhttp.DefaultTransport() instead of &http.Transport{}. cleanhttp, provides a transport that doesn't share state with Go's package-level http.DefaultTransport. Each runner gets isolated connection pools, so one probe can't accidentally affect another (poisoned keep-alive connections, exhausted dial slots, etc.).

3. CheckRedirect: http.ErrUseLastResponse: Redirects are not followed. If the server responds with a 301 we evaluate that response as successfull, we're checking this endpoint's reachability, not chasing what it points to.

4. Status code policy: 2xx and 3xx are success, 4xx and 5xx are failure. This was a refinement after the initial implementation. The earlier version treated any HTTP response as success (the philosophy being "reachability, not correctness"), but in practice a 503 from a load balancer is a useful failure signal the LB is up, but the backend isn't serving. The current rule:

  • < 200: shouldn't happen but treated as failure

  • 200–399: success (OK, Created, NoContent, MovedPermanently, Found)

  • >= 400: failure (BadRequest, Unauthorized, NotFound, InternalServerError)

Timeout is a context.WithTimeout wrapping the parent context, set from runner.Spec.HTTP.Timeout. The deadline is enforced across the entire round-trip DNS, TLS handshake, waiting for the first response byte, all of it.

We also do resp.Body.Close() on the success path. We don't read the body, only the status line matters but the body must be closed to return the connection to the pool. The error path doesn't close it because client.Do already drained/closed on error.

Layer 5 - Integration Test

File: internal/integration/api/probe-config.go

Integration tests run against a real Talos cluster, verifying the full end-to-end flow. TestHTTPProbeConfig covers three things:

  1. Config → Resource: Applying HTTPProbeConfig document creates a ProbeSpec with the HTTP fields populated

    suite.PatchMachineConfig(nodeCtx, probeConfig)
    
    rtestutils.AssertResource(nodeCtx, suite.T(), suite.Client.COSI, "http:"+probeURL,
        func(spec *networkres.ProbeSpec, asrt *assert.Assertions) {
            asrt.Equal(probeURL, spec.TypedSpec().HTTP.URL.String())
            asrt.Equal(5*time.Second, spec.TypedSpec().HTTP.Timeout)
        },
    )
    
  2. Runner executes: After a short wait, a ProbeStatus resource exists with success/failure data

    time.Sleep(3 * time.Second)
    // ProbeStatus.Success == true, or LastError is set — either proves the runner fired
    suite.Assert().True(status.TypedSpec().Success || status.TypedSpec().LastError != "")
    
  3. Cleanup works: Removing the config document removes the resource

    suite.RemoveMachineConfigDocuments(nodeCtx, network.HTTPProbeKind)
    rtestutils.AssertNoResource[*networkres.ProbeSpec](nodeCtx, suite.T(), suite.Client.COSI, "http:"+probeURL)
    

    One detail worth noting is the tests are skipped in airgapped mode suite.Airgapped since https://example.com requires outbound internet access. This is the same pattern used across other network-dependent integration tests in Talos.

Putting it all together

Here is exactly what happens in order, when a user applies an HTTPProbeConfig:

Step 1 - User applies YAML

apiVersion: v1alpha1
kind: HTTPProbeConfig
name: http-check
interval: 2s
failureThreshold: 3
url: https://example.com
timeout: 5s

Apply it:

talosctl patch mc --patch @http-probe.yaml

talosctl machineconfig patch sends this over the gRPC API to machined. The config subsystem parses the document into HTTPProbeConfigV1Alpha1 and stores it inside a MachineConfig COSI resource:

// http_probe.go
type HTTPProbeConfigV1Alpha1 struct {
    meta.Meta         // kind: HTTPProbeConfig, apiVersion: v1alpha1
    MetaName string   // "http-check"
    CommonProbeConfig // interval: 2s, failureThreshold: 3
    HTTPEndpoint string       // "https://example.com"
    HTTPTimeout  time.Duration // 5s
}

This document is held inside a Container (which can hold many documents at once). The container is wrapped into a MachineConfig COSI resource and written into the "config" namespace with ID "v1alpha1":

Namespace: "config"
Type:      "MachineConfigs.config.talos.dev"
ID:        "v1alpha1"
Spec:      MachineConfig{ Container{ ...HTTPProbeConfigV1Alpha1... } }

Step 2 - ProbeConfigController wakes up

MachineConfig changed -> COSI fires r.EventCh() in ProbeConfigController. The controller reads the active config:

cfg, err := safe.ReaderGetByID[*config.MachineConfig](ctx, r, config.ActiveID)

Then calls:

probeConfigs := cfg.Config().NetworkProbeConfigs()

This returns all TCP and HTTP probe documents as []NetworkCommonProbeConfig (see the Container section in Layer 2). Our HTTPProbeConfigV1Alpha1 is in that slice, the type switch in Step 3 handles it from there.

Step 3: ProbeConfigController translates config into a ProbeSpecSpec

// probe_config.go
for _, probeConfig := range probeConfigs {
    spec := network.ProbeSpecSpec{
        Interval:         probeConfig.Interval(),         // 2s
        FailureThreshold: probeConfig.FailureThreshold(), // 3
        ConfigLayer:      network.ConfigMachineConfiguration,
    }

    switch probeConfig := probeConfig.(type) {
    case configconfig.NetworkTCPProbeConfig:
        spec.TCP = network.TCPProbeSpec{...}
    case configconfig.NetworkHTTPProbeConfig:          // ← our new case
        spec.HTTP = network.HTTPProbeSpec{
            URL:     probeConfig.URL(),     // "https://example.com"
            Timeout: probeConfig.Timeout(), // 5s
        }
    }

    specs = append(specs, spec)
}

The type switch narrows from NetworkCommonProbeConfigNetworkHTTPProbeConfig. This is how the controller learns it's an HTTP probe and fills the HTTP field rather than the TCP field.

Step 4 - ProbeConfigController writes a ProbeSpec

ProbeSpec written to network-config.

Namespace: "network-config"
Type:      "ProbeSpecs.net.talos.dev"
ID:        "configuration/http:https://example.com"
Spec:      ProbeSpecSpec{
    Interval: 2s, FailureThreshold: 3,
    HTTP: HTTPProbeSpec{URL: "https://example.com", Timeout: 5s},
    ConfigLayer: ConfigMachineConfiguration,
}

Step 5 - ProbeController wakes up

ProbeController watches ProbeSpecType in the network namespace. In Talos, a merge controller actually copies specs from network-confignetwork . Once that copy happens, ProbeController's r.EventCh() fires.

It lists all ProbeSpecs:

specList, _ := safe.ReaderListAll[*network.ProbeSpec](ctx, r)

It sees "http:https://example.com" is in the list but has no Runner yet. So it starts one:

ctrl.runners["http:https://example.com"] = &probe.Runner{
    ID:   "http:https://example.com",
    Spec: ProbeSpecSpec{HTTP: HTTPProbeSpec{URL: "https://example.com", Timeout: 5s}, ...},
}
ctrl.runners["http:https://example.com"].Start(ctx, notifyCh, logger)

Step 6 - Runner goroutine runs the probe

Every interval the runner calls probeHTTP() (see Layer 4 for the full implementation). On success, it sends:

notifyCh ← Notification{ ID: "http:https://example.com", Status: {Success: true} }

On failure, the FailureThreshold debounce applies (see Layer 4) before any Success: false is emitted.

Step 7 - Runner sends result back to ProbeController

ProbeController receives the notification and writes:

Namespace: "network"
Type:      "ProbeStatuses.net.talos.dev"
ID:        "http:https://example.com"
Spec:      ProbeStatusSpec{Success: true, LastError: ""}

Verify the resource was created:

talosctl get probespec
NAMESPACE       TYPE        ID                              VERSION
network         ProbeSpec   http:https://example.com    1
talosctl get probestatus
NAMESPACE  TYPE         ID                           VERSION    SUCCESS   LAST ERROR
network    ProbeStatus  http:https://example.com     3         true

What Happens on Deletion

When the user removes the HTTPProbeConfig document from the machine config:

  1. MachineConfig resource is updated (no more HTTP probe doc inside it)

  2. ProbeConfigController wakes up, reconciles, finds zero probeConfigs from NetworkProbeConfigs()

  3. It writes nothing to network-config this round

  4. CleanupOutputs fires : it sees that "configuration/http:https://example.com" exists but was NOT written this round, so it calls Destroy on it

  5. ProbeController wakes up, lists ProbeSpecs : "http:https://example.com" is gone

  6. It calls ctrl.runners["http:https://example.com"].Stop() , this cancels the runner goroutine's context and waits for it to exit

  7. It also destroys the ProbeStatus resource since there's no longer a matching spec

  8. talosctl get probestatuses shows nothing

Summary

Talos already had well layered network probes: a config layer, a resource layer, three controllers, and a probe runner. TCP probes worked end to end. HTTP support was a matter of extending each layer by one piece, following the same pattern that TCP had established.

The layered pattern means new probe types are purely additive, you slot in at each layer without touching anything in between. What this implementation taught me is that in a well-structured codebase, most of the work is understanding where things belong, not writing the code itself.

Talos Linux

Part 1 of 1

Blog series on my contributions to open source work in talos linux