Monitoring my Homelab, Simply

Date:

I have a middling self-hosted/homelab setup, and it occasionally breaks.

Alas, no monitoring tool has ever sparked joy in me.

Don’t get me wrong, I understand that they’re essential for large fleets of services with fast-changing software and teams of oncallers working around the clock to understand the complex ways that complex systems fail… but my stuff doesn’t change that often, failures are mundane and low-scope, and I’m the only person coming to rescue this poor stopped systemd unit that failed to restart, or open the port that I accidentally blocked in an overzealous attempt to stop the barbarians.

So between no monitoring and Prometheus, I’ll choose no monitoring every day.

But there’s a spectrum.

What I want from a monitoring setup

Simple, by which I mean:

Out-of-scope:

Design

tl;dr I wrote a tiny program, that regularly checks HTTP/DNS/etc., and pages me via https://ntfy.sh

I’ve been running this for 2-3 years now.

Show me the code

A prober probes at an interval, and has a description.

type prober interface {
    probe(context.Context) error
    interval() time.Duration
    String() string
}

Here’s an example prober for checking TLS connectivity and cert expiration:

type tlsProber struct {
    host string
    port int
    duration time.Duration
}

func (p *tlsProber) probe(ctx context.Context) error {
    target := fmt.Sprintf("%s:%d", p.host, p.port)
    conn, err := tls.Dial("tcp", target, &tls.Config{})
    if err != nil {
        return err
    }
    defer conn.Close()
    cs := conn.ConnectionState()
    if len(cs.PeerCertificates) == 0 {
        return fmt.Errorf("want >0 certs, got 0")
    }

    cert := cs.PeerCertificates[0]
    expiration := cert.NotAfter
    if diff := expiration.Sub(time.Now()); diff < 14*24*time.Hour {
        return fmt.Errorf("cert expires in %v at %v", diff, expiration)
    }
    return nil
}

func (p *tlsProber) interval() time.Duration {
    return p.duration
}

func (p *tlsProber) String() string {
    return fmt.Sprintf("TLS %s:%d", p.host, p.port)
}

I have similar things for HTTP, TCP, DNS and domain expiration. I have a decorator-style retryProber for if things are flaky.

I run each prober in a loop, notifying once for failures, with reminder notifications every hour. When the probe proceeds, it notifies too.

const reminderInterval = 1*time.Hour

type state int
const (
    good state = iota
    bad
)

func run(ctx context.Context, p prober, n notifier) {
    defer panic("Unreachable. Don't stop ticking. Never return!")
    log.Printf("Runner started for prober %q with interval %v.", p.String(), p.interval())
    ticker := time.NewTicker(p.interval())
    reminderTicker := time.NewTicker(reminderInterval)

    s := good
    var lastErr error
    for {
        select {
        case _ = <-ticker.C:
            // Jitter. Also, prevent crash loops from causing damage.
            time.Sleep(time.Duration(rand.Intn(5000))*time.Millisecond)
            log.Printf("Probing %s", p.String())
            err := p.probe(ctx)
            if err != nil {
                // bad
                if s == good {
                    if err := n.notify(ctx, fmt.Sprintf("%s: %v", p.String(), err)); err != nil {
                        log.Printf("Notify failed: %v", err)
                    }
                } else {
                    log.Printf("%s failed but not re-notifying: %v", p.String(), err)
                }
                s = bad
                lastErr = err
                continue
            }

            // good
            if s == bad {
                if err := n.notify(ctx, fmt.Sprintf("%s: now passing", p.String())); err != nil {
                    log.Printf("Notify failed: %v", err)
                }
                s = good
                lastErr = nil
            }
        case _ = <-reminderTicker.C:
            log.Printf("Checking whether to ping a reminder for %s", p.String())
            if s != good {
                if err := n.notify(ctx, fmt.Sprintf("%s still failing: %v", p.String(), lastErr)); err != nil {
                    log.Printf("Notify failed: %v", err)
                }
            }
        }
    }
}

If I was to review my own code I’d say s should be named prevState, and // bad needs more clarification. But this is my code and my blog, so LGTM!

I’ve left some boring bits out like main, notifier, but I will mention that probers are defined in code:

var mythicBeastsNS = []string{
    "ns1.mythic-beasts.com",
    "ns2.mythic-beasts.com",
}

var probers = []prober{
    &tlsProber{host: "b.tuxes.uk", port: 443, duration: 5*time.Minute},
    &tcpProber{host: "my-vps.example.com", port: 22, duration: 5*time.Minute},
    &domainProber{domain: "tuxes.uk", rdapURL: "https://rdap.nominet.uk/uk/domain/tuxes.uk", expectedNS: mythicBeastsNS},
}

This saves me dealing with parsing, config files, and offers the flexibility of code.

Deployment

This can run ~anywhere. It doesn’t listen, so no need to put it somewhere that can be scraped.

This happens to run as a DynamicUser daemon on my router: my most reliable machine; a dependency to me fixing things anyway; and something I’ll notice if down.

Receiving pages

I have https://ntfy.sh installed on my phone to receive pages. It’s respectful of battery, and can silence for n hours.

I can’t silence particular pages, so every silence is “global”. This is fine, I just need to know “I should fix a thing”. A global silence could mask growing problems. Meh. It’ll do.

How to monitor the monitor?

If my daemon crashes, or I accidentally drop my computer in a stew, how long will it be until I notice I have no visibility?

To avoid this, I ping https://healthchecks.io as a dead man’s switch. If that doesn’t get a ping in the last 10 minutes it notifies me via https://ntfy.sh and email.

There’s a subtlety here. If my program reliably crashes after 6 mins of runtime (e.g. due to memory leak, or a panic that happens on 2nd probe), but I manage to ping healthchecks.io every 5 minutes, then my dead man’s switch will be happy, but my less frequent probers won’t ever be called.

So I have two dead man switch’s:

Those are also just HTTP probers:

var probers = []prober{
    // ...
    
    // Shortwave ping that is meant to quickly detect if this server is down.
    &httpProber{url: "https://hc-ping.com/79bfdda2-b162-451c-bc33-58d925240434", expected: 200, duration: 5*time.Minute},

    // Longwave ping that is meant to detect if this binary is crash looping.
    &httpProber{url: "https://hc-ping.com/05bdb3cc-4a2f-48dd-9a24-b08977e7f6a9", expected: 200, duration: 2*reminderInterval},
}

I can’t remember why I decided that the longwave probe should have a period of 2*reminderInterval, but I remember thinking hard about it, and thinking I was quite the scholar for thinking hard about this. If you figure it out and I’m correct, then please let me know.

What I like about it

Limitations

Alternatives

Probing is nothing new. There are tons of alternatives.

No monitoring

I could just wait for me or my family to notice stuff is broken. It’s a reliable way to learn of stuff that actually needs fixing. But I’d like to spare that inconvenience, and have notice to fix things. Besides, I’m usually the cause of breakages, and if I can learn shortly after breaking it, I’m usually in a good spot to fix it.

updown.io

A great site. Sadly it doesn’t have that many probers, and can’t see within my mesh network.

Uptime Kuma

Pro: Looks nice. Off-the-shelf.

Con: Difficult to extend. Persistent state.

Prometheus and Blackbox exporter

Nooooo