The deploy worked. 616 files, rsync, done. Then I tried to SSH back in to set up nginx, and hit three failures in a row.
The Sequence
| Step | What Happened | Outcome |
|---|---|---|
| 1 | Deploy via rsync | Success — 616 files transferred |
| 2 | SSH to configure nginx | Failed — key rejected |
| 3 | Retry SSH with variations | fail2ban triggered — IP banned |
| 4 | Check password manager | Found SSH key — labeled for a different provider |
| 5 | Try console login via hosting panel | Root password unknown |
| 6 | Reset root password via panel | Console access restored |
| 7 | Write nginx config | Failed — references files that don’t exist |
| 8 | Session abandoned | Too many compounding failures |
Three distinct failures. Each recovery attempt either failed or created a new problem.
Failure 1: The Key That Wasn’t Where I Thought
The SSH key lives in a password manager, not on disk. Fine — that’s actually the secure approach. But the entry was labeled “DigitalOcean” even though the server hasn’t been on DigitalOcean for months. I was looking for an entry matching the current hosting provider and couldn’t find one.
The key itself was correct. The label was wrong. A labeling error turned a working credential into an unfindable one at the worst possible moment.
Failure 2: The Lockout That Punished Troubleshooting
Each failed SSH attempt counts toward fail2ban’s threshold. I hit the limit while still figuring out Failure 1. The server locked my IP — the standard, correct response to multiple failed authentication attempts.
The security tool did exactly what it’s supposed to do. The problem is that the troubleshooting process for “why can’t I log in?” generates exactly the traffic pattern that triggers “you’re attacking me.”
A VPN overlay network later saved this — it routes through a different IP, bypassing the ban entirely. But I didn’t try that until the evening resolution session.
Failure 3: The Config That Assumed a Template
With console access restored, I wrote an nginx config for the new subdomain. It failed validation. The SSL line referenced options-ssl-nginx.conf — the standard certbot include path that appears in every nginx tutorial.
This server doesn’t use that file. It uses snippets/ssl-params.conf. Also, the SSL certificate for subdomains uses a different path than the base domain — a wildcard cert at a -0001 suffix, not the base cert.
I’d written a config from memory instead of reading what was already working on the server.
| What I Wrote | What the Server Actually Uses |
|---|---|
options-ssl-nginx.conf | snippets/ssl-params.conf |
| Base domain cert path | Wildcard cert path (-0001 suffix) |
The Resolution
Not that session. I abandoned it after the nginx failure — too many compounding failures, diminishing returns on each attempt.
Came back at 11:16 PM:
- Fixed nginx — read an existing working site config first, derived SSL directives from that instead of from memory
- Switched SSH alias to VPN IP — fail2ban only watches the public interface, VPN traffic bypasses it entirely
- Deleted stale password manager entry — the “DigitalOcean” label that caused the initial confusion
The site was live shortly after. What should have taken a few minutes after the deploy stretched across the rest of the evening — most of that time spent not working on it.
The Pre-Flight Protocol
Added this to the global instructions file the same night:
Before any task involving the VPS:
- Verify SSH works:
ssh server-alias "id"— if this fails, stop- Check password manager is unlocked
- For nginx changes — read an existing working config first, never use a template from memory
Three checks. Each one would have caught a different failure from this session before it cascaded into the next one.
Why cascades are worse than individual failures
An individual failure has a clear fix. A cascade has compounding context — by the time you reach failure three, you’re frustrated from failures one and two, your mental model of the system is shaken, and you’re more likely to reach for shortcuts (like writing configs from memory instead of reading what works).
The six-hour wall-clock time wasn’t because the problems were hard. Each one was a 2-minute fix. The time was consumed by the emotional overhead of cascading failures and the decision to walk away — which turned out to be the most productive decision of the evening.
The Damage Report
| Metric | Value |
|---|---|
| Deploy time | ~2 minutes |
| Time locked out | Hours (mostly voluntary break) |
| Distinct failures | 3 |
| Recovery attempts that created new problems | 2 |
| Actual fix time (at 11:16 PM) | ~4 minutes |
| Prevention rules added | 3-step pre-flight protocol |
The Pattern
The cascade: each recovery attempt either fails or creates a new problem. Retrying SSH triggers fail2ban. Getting console access reveals the nginx config is wrong. Each step forward opens a new failure sideways.
The countermeasure is verification before action. Check that SSH works before deploying something that needs post-deploy configuration. Check that you can read before you try to write. The pre-flight exists because the cost of three checks is trivial compared to the cost of three compounding failures.
The other countermeasure is walking away. In the first session I was fighting the system. At 11:16 PM I was reading it. Same person, same problems, different approach.