When the fault is in the star (topology)

It has been a gruelling few weeks in the lab, a troubleshooting marathon that has thoroughly tested the patience of both the amateur scientist and the practicing engineer in me. There is nothing quite as maddening as an intermittent fault in a system designed for high availability. My Proxmox cluster, specifically node pve3, decided to start haunting me with random, catastrophic isolations. One moment everything was green and quorate; the next, pve3 was fenced off, isolated, and seemingly dead to the world. The only way to bring it back was a hard power cycle—a brute force solution that felt like a defeat every single time.

My initial hypothesis was load-based, which seemed logical at the time. I blamed the night sync scripts saturating the link or the massive I/O hammer drop from the nightly Proxmox Backup Server snapshot. It made sense: if the pipes are clogged, the heartbeats can’t get through. I scrutinized the downloaders and stared at bandwidth graphs, convinced that I was just asking too much of the hardware. But the data refused to fit the narrative. The logs were maddeningly contradictory. Corosync would scream about token timeouts, complaining that it hadn’t heard from its peers in over 20 seconds—an eternity in cluster time. Yet, when I dove into the host’s kernel logs, they were pristine. No driver crashes, no e1000e errors, and crucially, absolutely no link-down events. It was a “silent black hole.” The OS believed it was connected, the switch statistics showed zero packet errors, yet packets were simply vanishing into the ether.

I tried architectural mitigations first, assuming that maybe my network was just a bit jittery. I edited the config to increase the Corosync token timeout to a generous 20,000ms. I was effectively telling the cluster, “It’s okay, take your time, don’t panic if a packet is late.” It didn’t matter. The universe mocked me; the node crashed again with an outage lasting exactly 21.3 seconds—just barely, but decisively, beating my new tolerance window. Mitigation had failed, so I pivoted to forensic isolation. I needed to know if the problem was the workload or the node. I migrated the heavy VM workloads, including the backup server, over to pve1 to rule out I/O starvation on pve3. I waited. pve3 crashed anyway. Crucially, it crashed at 23:38—well outside the backup window. That was the smoking gun: it wasn’t the software workload. The node was dying while idle.

I went deeper, tearing apart the network stack. I found a nasty mismatch of RSTP and MSTP protocols across my Netgear and Mikrotik switches—a definite misconfiguration—but even fixing that didn’t stop the rot. Not that I thought it was relevant. but you never know! Anyway, I was left with the impossible: a healthy link that refused to forward traffic. There was only one variable left. I walked over to the rack and physically swapped the node from port g4 to port g1 on the switch. Since that moment, I’ve had 11 days of uninterrupted uptime. Statistically, it’s definitive proof. Port g4 has some subtle, microscopic silicon logic fault that doesn’t trigger error counters but drops packets when it feels like it. The scientist in me hates that I can’t spawn a parallel universe to prove the counterfactual, to know for sure it wouldn’t have crashed anyway. But the practicing engineer knows when to take “yes” for an answer. The cluster is stable, the media is moving, and port g4 is administratively disabled and marked “FAULTY” with a label maker. I’ll take the win.

Leave a Reply

Your email address will not be published. Required fields are marked *

recaptcha placeholder image
 

This site uses Akismet to reduce spam. Learn how your comment data is processed.