Deploy Succeeded. Locked Out Ten Minutes Later.

The deploy worked. 616 files, rsync, done. Then I tried to SSH back in to set up nginx, and hit three failures in a row.

The Sequence

Step	What Happened	Outcome
1	Deploy via rsync	Success — 616 files transferred
2	SSH to configure nginx	Failed — key rejected
3	Retry SSH with variations	fail2ban triggered — IP banned
4	Check password manager	Found SSH key — labeled for a different provider
5	Try console login via hosting panel	Root password unknown
6	Reset root password via panel	Console access restored
7	Write nginx config	Failed — references files that don’t exist
8	Session abandoned	Too many compounding failures

Three distinct failures. Each recovery attempt either failed or created a new problem.

Failure 1: The Key That Wasn’t Where I Thought

The SSH key lives in a password manager, not on disk. Fine — that’s actually the secure approach. But the entry was labeled “DigitalOcean” even though the server hasn’t been on DigitalOcean for months. I was looking for an entry matching the current hosting provider and couldn’t find one.

The key itself was correct. The label was wrong. A labeling error turned a working credential into an unfindable one at the worst possible moment.

Failure 2: The Lockout That Punished Troubleshooting

Each failed SSH attempt counts toward fail2ban’s threshold. I hit the limit while still figuring out Failure 1. The server locked my IP — the standard, correct response to multiple failed authentication attempts.

The security tool did exactly what it’s supposed to do. The problem is that the troubleshooting process for “why can’t I log in?” generates exactly the traffic pattern that triggers “you’re attacking me.”

A VPN overlay network later saved this — it routes through a different IP, bypassing the ban entirely. But I didn’t try that until the evening resolution session.

Failure 3: The Config That Assumed a Template

With console access restored, I wrote an nginx config for the new subdomain. It failed validation. The SSL line referenced options-ssl-nginx.conf — the standard certbot include path that appears in every nginx tutorial.

This server doesn’t use that file. It uses snippets/ssl-params.conf. Also, the SSL certificate for subdomains uses a different path than the base domain — a wildcard cert at a -0001 suffix, not the base cert.

I’d written a config from memory instead of reading what was already working on the server.

What I Wrote	What the Server Actually Uses
`options-ssl-nginx.conf`	`snippets/ssl-params.conf`
Base domain cert path	Wildcard cert path (`-0001` suffix)

The Resolution

Not that session. I abandoned it after the nginx failure — too many compounding failures, diminishing returns on each attempt.

Came back at 11:16 PM:

Fixed nginx — read an existing working site config first, derived SSL directives from that instead of from memory
Switched SSH alias to VPN IP — fail2ban only watches the public interface, VPN traffic bypasses it entirely
Deleted stale password manager entry — the “DigitalOcean” label that caused the initial confusion

The site was live shortly after. What should have taken a few minutes after the deploy stretched across the rest of the evening — most of that time spent not working on it.

The Pre-Flight Protocol

Added this to the global instructions file the same night:

Before any task involving the VPS:

Verify SSH works: ssh server-alias "id" — if this fails, stop

Check password manager is unlocked

For nginx changes — read an existing working config first, never use a template from memory

Three checks. Each one would have caught a different failure from this session before it cascaded into the next one.

Why cascades are worse than individual failures

An individual failure has a clear fix. A cascade has compounding context — by the time you reach failure three, you’re frustrated from failures one and two, your mental model of the system is shaken, and you’re more likely to reach for shortcuts (like writing configs from memory instead of reading what works).

The six-hour wall-clock time wasn’t because the problems were hard. Each one was a 2-minute fix. The time was consumed by the emotional overhead of cascading failures and the decision to walk away — which turned out to be the most productive decision of the evening.

The Damage Report

Metric	Value
Deploy time	~2 minutes
Time locked out	Hours (mostly voluntary break)
Distinct failures	3
Recovery attempts that created new problems	2
Actual fix time (at 11:16 PM)	~4 minutes
Prevention rules added	3-step pre-flight protocol

The Pattern

The cascade: each recovery attempt either fails or creates a new problem. Retrying SSH triggers fail2ban. Getting console access reveals the nginx config is wrong. Each step forward opens a new failure sideways.

The countermeasure is verification before action. Check that SSH works before deploying something that needs post-deploy configuration. Check that you can read before you try to write. The pre-flight exists because the cost of three checks is trivial compared to the cost of three compounding failures.

The other countermeasure is walking away. In the first session I was fighting the system. At 11:16 PM I was reading it. Same person, same problems, different approach.