Skip to content
Second Brain Chronicles
Go back

Deploy Succeeded. Locked Out Ten Minutes Later.

Deploy Succeeded. Locked Out Ten Minutes Later.

The deploy worked. 616 files, rsync, done. Then I tried to SSH back in to set up nginx, and hit three failures in a row.

The Sequence

StepWhat HappenedOutcome
1Deploy via rsyncSuccess — 616 files transferred
2SSH to configure nginxFailed — key rejected
3Retry SSH with variationsfail2ban triggered — IP banned
4Check password managerFound SSH key — labeled for a different provider
5Try console login via hosting panelRoot password unknown
6Reset root password via panelConsole access restored
7Write nginx configFailed — references files that don’t exist
8Session abandonedToo many compounding failures

Three distinct failures. Each recovery attempt either failed or created a new problem.

Failure 1: The Key That Wasn’t Where I Thought

The SSH key lives in a password manager, not on disk. Fine — that’s actually the secure approach. But the entry was labeled “DigitalOcean” even though the server hasn’t been on DigitalOcean for months. I was looking for an entry matching the current hosting provider and couldn’t find one.

The key itself was correct. The label was wrong. A labeling error turned a working credential into an unfindable one at the worst possible moment.

Failure 2: The Lockout That Punished Troubleshooting

Each failed SSH attempt counts toward fail2ban’s threshold. I hit the limit while still figuring out Failure 1. The server locked my IP — the standard, correct response to multiple failed authentication attempts.

The security tool did exactly what it’s supposed to do. The problem is that the troubleshooting process for “why can’t I log in?” generates exactly the traffic pattern that triggers “you’re attacking me.”

A VPN overlay network later saved this — it routes through a different IP, bypassing the ban entirely. But I didn’t try that until the evening resolution session.

Failure 3: The Config That Assumed a Template

With console access restored, I wrote an nginx config for the new subdomain. It failed validation. The SSL line referenced options-ssl-nginx.conf — the standard certbot include path that appears in every nginx tutorial.

This server doesn’t use that file. It uses snippets/ssl-params.conf. Also, the SSL certificate for subdomains uses a different path than the base domain — a wildcard cert at a -0001 suffix, not the base cert.

I’d written a config from memory instead of reading what was already working on the server.

What I WroteWhat the Server Actually Uses
options-ssl-nginx.confsnippets/ssl-params.conf
Base domain cert pathWildcard cert path (-0001 suffix)

The Resolution

Not that session. I abandoned it after the nginx failure — too many compounding failures, diminishing returns on each attempt.

Came back at 11:16 PM:

  1. Fixed nginx — read an existing working site config first, derived SSL directives from that instead of from memory
  2. Switched SSH alias to VPN IP — fail2ban only watches the public interface, VPN traffic bypasses it entirely
  3. Deleted stale password manager entry — the “DigitalOcean” label that caused the initial confusion

The site was live shortly after. What should have taken a few minutes after the deploy stretched across the rest of the evening — most of that time spent not working on it.

The Pre-Flight Protocol

Added this to the global instructions file the same night:

Before any task involving the VPS:

  1. Verify SSH works: ssh server-alias "id" — if this fails, stop
  2. Check password manager is unlocked
  3. For nginx changes — read an existing working config first, never use a template from memory

Three checks. Each one would have caught a different failure from this session before it cascaded into the next one.

Why cascades are worse than individual failures

An individual failure has a clear fix. A cascade has compounding context — by the time you reach failure three, you’re frustrated from failures one and two, your mental model of the system is shaken, and you’re more likely to reach for shortcuts (like writing configs from memory instead of reading what works).

The six-hour wall-clock time wasn’t because the problems were hard. Each one was a 2-minute fix. The time was consumed by the emotional overhead of cascading failures and the decision to walk away — which turned out to be the most productive decision of the evening.

The Damage Report

MetricValue
Deploy time~2 minutes
Time locked outHours (mostly voluntary break)
Distinct failures3
Recovery attempts that created new problems2
Actual fix time (at 11:16 PM)~4 minutes
Prevention rules added3-step pre-flight protocol

The Pattern

The cascade: each recovery attempt either fails or creates a new problem. Retrying SSH triggers fail2ban. Getting console access reveals the nginx config is wrong. Each step forward opens a new failure sideways.

The countermeasure is verification before action. Check that SSH works before deploying something that needs post-deploy configuration. Check that you can read before you try to write. The pre-flight exists because the cost of three checks is trivial compared to the cost of three compounding failures.

The other countermeasure is walking away. In the first session I was fighting the system. At 11:16 PM I was reading it. Same person, same problems, different approach.


Share this post on:

Previous Post
Thirty Minutes Debugging the Threads API, Then I Just Pasted It
Next Post
Every Bash Command Triggered Touch ID