Gluecode Journal

Consider No

Wed, 13 May 2026 01:00:00 +0000

I know I've written a lot, at length, about politics and lessons learned. My most recent post is inherently political.

I thought maybe white supremacy would be extinquished. I thought extensive antitrust enforcement would change company behavior. I thought maybe, just maybe, customer sentiment would get companies to back off on shoving more ads, tracking, genAI, down our throats. I thought maybe a collective upset would make companies see we don't want the death of generalized, cheap computing.

None of that seems to have happened, or worked.

None of this is a matter of progression, or evolution, or people learning. The information is there. People were loud about it. Those who we keep thinking we can "change their mind" or change their behavior, don't.

We keep giving chances, and exceptions, and trust to those who have literally abused every opportunity they've been given. Companies, republicans, anyone with power -- they take the stance of "fuck you, I do what I want."

So what are we to do?

At some point, any person needs to be told "no." If corporations-are-people (ugh), then we need to tell corporations, government, anybody who's starting down the path of ruining what's good for the rest of us the full and complete sentence:

No.

How to handle AI inputs as a culture

Mon, 20 Apr 2026 01:00:00 +0000

As much as I hate generative-LLM output in systems, and how quickly it's being pushed as the cure-all for business problems, I have to admit that a broken clock can be "right" every so often. LLM outputs are rapidly becoming as useful as boilerplate generators, or fuzzers, or the RAD IDE environments.

How we as a society are reacting to "AI" outputs reminded me of the immune response of the human body -- namely, some of us are okay with it (largely because the use case doesn't matter) and some of us are VERY MUCH NOT OKAY with it, to the point where we have violent reactions to its existence in our spaces.

Generative LLM = Virus

As one of my favorite authors put it:

Imagine that you encounter a signal. It is structured, and dense with information. It meets all the criteria of an intelligent transmission. Evolution and experience offer a variety of paths to follow, branch-points in the flowcharts that handle such input. Sometimes these signals come from conspecifics who have useful information to share, whose lives you'll defend according to the rules of kin selection. Sometimes they come from competitors or predators or other inimical entities that must be avoided or destroyed; in those cases, the information may prove of significant tactical value. Some signals may even arise from entities which, while not kin, can still serve as allies or symbionts in mutually beneficial pursuits. You can derive appropriate responses for any of these eventualities, and many others.

You decode the signals, and stumble

[...]

There are no meaningful translations for these terms. They are needlessly recursive. They contain no usable intelligence, yet they are structured intelligently; there is no chance they could have arisen by chance.

The only explanation is that something has coded nonsense in a way that poses as a useful message; only after wasting time and effort does the deception becomes apparent. The signal functions to consume the resources of a recipient for zero payoff and reduced fitness. The signal is a virus.

The signal is an attack.

We aren't wrong to treat the invasion of plausibly-looking bad data like we would any virus. The problem is that our "programming immune systems" haven't had to deal with such an onslaught of plausible-but-bad data before. As Paracelsus said, "the dose makes the poison."

So, extending the metaphor of the human body, let's look at how our bodies have actually dealt with previous invasions, specifically of retroviruses -- the kind that want to persist their sequences into our bodies and proliferate forever.

Retroviral Response

A Survey of Human Genes of Retroviral Origin shows us that ~8% of our tissue sequences have retroviral components. Eight percent of anything isn't all that much, but what the document shows is the sequences are not evenly distributed. Places like the placenta, the testes, or even the skin would have 10-100x the retroviral sequences as tissues of the thymuss, or the bone marrow. And there were almost no sequences to be found in the heart or the liver.

This tells us that, where having some unexpected coding sequences or errors is low-risk or even expected (the placenta has to deal with a literal foreign-body with its own DNA for 40 weeks), introduction of viral material isn't all that important. In some cases, it might have even been helpful to the reproductive outcomes of our species.

Conversely, when you have a very critical system like the liver or the heart, nothing can be tweaked or modified or replaced. Every part of that system needs to function as-intended, or the host dies.

The human body has responded and set up a procedure to fight the invasion of viral content -- terminate the invaders, sure, but also, leave the battlefield alone when it's done.

We're going to have some volume of AI-generated content in our software, in our libraries, our videos, etc, because it's not critical for the things we do to be 100% perfect every time. If we treat the presence of AI content as a "kill it with fire" response every time, we'll end up in the equivalent of a cytokine storm and burn ourselves out (or remove ourselves entirely) just trying to fight it.

If only we'd fought the nazis with such fervor.

Citations for relevant papers:

de Parseval N, Lazar V, Casella JF, Benit L, Heidmann T. Survey of human genes of retroviral origin: identification and transcriptome of the genes with coding capacity for complete envelope proteins. J Virol. 2003 Oct;77(19):10414-22. doi: 10.1128/jvi.77.19.10414-10422.2003. PMID: 12970426; PMCID: PMC228468.

Chuong EB. The placenta goes viral: Retroviruses control gene expression in pregnancy. PLoS Biol. 2018 Oct 9;16(10):e3000028. doi: 10.1371/journal.pbio.3000028. PMID: 30300353; PMCID: PMC6177113.

Neovim and the Terminal

Mon, 16 Feb 2026 01:00:00 +0000

I've been using "alternative Terminal emulators" on MacOS for... quite a while. It's one of the few things I must do for Neovim with all its colors and syntax highlighting, because otherwise why would I have a true-color scheme and all that syntax highlighting?

The reality is I don't use a lot of special colors in my editor anyway. My default .vimrc even turns off syntax highlighting and colors. So now I've got a LazyVim enabled setup with true-color using GhosTTY as my emulator and... I do all that for a cursor-color change when in various modes. I even set up terminal color schemes so that it would take advantage of the terminal-provided palette.

I don't think MacOS terminal is actually unable to do any of this, since I'm able to "see" the cursor with alternating contrast colors in native vim... so what am I missing?

Turns out, it was my neovim colorscheme. I had already switched over MacOS to the "Developer Colors" palette, but I was still seeing "wrong" colors inside neovim. By switching to the quiet colorscheme instead of minischeme, I'm now able to get nice, terminal-color-highlighted selections and alternating cursor colors.

Solution

New MacOS Terminal Profile --> Change color palette to "Developer"
Neovim --> use the colorscheme "quiet"
Do a happy dance

How DHCP nerfed my NIC

Sat, 14 Feb 2026 01:00:00 +0000

Bought a new 10GbE-capable NIC using the intel 550 chipset. I had purchased one previously but that was for when I had an Eero Pro unit with a 10Gbit port on the unit.

Now that I've "downgraded" my network to a 2.5gbit switch, I needed a unit that could do the intermediate steps between gigabit and 10GbE. And yes, I've heard much moaning about how the 2.5 and 5GbE are just hobbled steps on the way to full 10GbE -- doesn't matter, it's what we have and we need it to work at line speed.

Problem is, the DHCP auto-negotiation speeds only allow 10GbE and 1GbE; nothing else in-between will autonegotiate. Turns out this is by design!

 * 2.5G and 5G autonegotiation speeds on X550
 * are disabled by default due to reported
 * interoperability issues with some switches.

How do I fix it?

First, accept that you're gonna have to let go of DHCP for a minute. If you can't do without DHCP on your device, that is... for a future post! Yeah! >_>

My unit is an X550-T2 unit, meaning it has two ports: ix0 and ix1, so I have to double-up my commands here. The active port that I selected is ix1 if you want the absolute must-do lines.

First we have to add some overrides to the loader.conf because the ix driver does not allow changes to advertised speeds after boot.

[/boot/loader.conf]

if_ix_load="YES"
hw.ix.0.advertise_speed=0x10
hw.ix.1.advertise_speed=0x10
dev.ix.0.advertise_speed=0x10
dev.ix.1.advertise_speed=0x10

The 0x10 is special because the (as said before) those speeds are disabled by default:

/************************************************************************
 * ixgbe_get_default_advertise - Get default advertised speed settings
 *
 *   Formatted for sysctl usage.
 *   Flags:
 *     0x1 - advertise 100 Mb
 *     0x2 - advertise 1G
 *     0x4 - advertise 10G
 *     0x8 - advertise 10 Mb (yes, Mb)
 *     0x10 - advertise 2.5G (disabled by default)
 *     0x20 - advertise 5G (disabled by default)
 ************************************************************************/

Then we reboot. This should have let DHCP autonegotiate to a 2.5Gbit line speed, but it didn't. I went through a few more options and finally said "I must tell the machine what to do."

Thus I changed the following lines in my rc.conf:

[/etc/rc.conf]

#ifconfig_ix1="DHCP"
ifconfig_ix1="inet xxx.xxx.xxx.xxx netmask 0xffffff00 media 2500Base-T"
defaultrouter="xxx.xxx.xxx.xxx"

Now it works! And it's pretty snappy. Don't forget to add the defaultrouter line or you won't be able to reach anything.

GoToSocial instance shut down

Tue, 27 Jan 2026 01:00:00 +0000

For reasons I still don't understand, my GoToSocial instance (at gts.gluecode.net) stopped receiving updates from federated instances. This is now the second time it's happened and I still can't sort out why.

As such, as of today I am reverting my account from wavefunction@gts.gluecode.net to wavefunction@mastodon.sdf.org.

I'm rather upset. This was supposed to be sigificantly easier and we've made it difficult. Again.

I'll set up a native host in the future. Until then, I leave the social hosting to someone else.

[ Update ] Surprising no one, GTS has a problem redirecting accounts away to the new instance. Cool cool. Time to just shut it the hell down.

We made some mistakes

Wed, 14 Jan 2026 01:00:00 +0000

There's so much that's gone wrong with our (currently unfunded, non-functioning) federal government and its policies, but "when we get back to normal" there's still a lot of work to do.

We, the People, have made some serious mistakes. If we want to fix them, there are clear paths to addressing them, and the only question is "how fast"? *

Considering the reality of our legislative environment, we probably need to move as fast as possible, with the fewest possible means of removing said changes to prevent assholes from fucking it all up.

This post kept getting longer and longer... it's time to publish.

Procedural Changes

Let's start with a few things:

Voting

In an extension of the constitutional amendments, prohibit all "tests" for polling places. The maximum set of requirements is to show some form of ID the first time you go to vote. (As an example, here's California's set of requirements.) Otherwise, no restrictions can be added.
Prohibit gerrymandering using validation mechanisms already tested and verified by subject experts (not politicians or judges).
While the Federal government can't force states to change, push for ranked-choice voting.
- Enable ranked-choice voting within federal systems where possible.

Judicial System

Staff 23 justices in SCOTUS. Pack them with progressives (the kind that make socialists squee) but also throw in some others. Do judicial rotations instead of en-banc panels. No first-time judges.
Set term limits for SCOTUS. (18 years seems like a good number.)
Enforceable ethics rules for circuit judges, etc must also apply to SCOTUS, with full legal consequences.
Staff the 40+ vacant federal judgships and add another 400 necessary to process the backlog of asylum and immigration cases. (Also, to do more judging! We need courts to get to [necessary] hearings faster.)
- Absent the above change, open up our borders.
Reinterpret the second amendment such that individual gun ownership is only allowed with court order or restraining order ("proof of need").

Legislature

Eliminate Qualified Immunity at the federal level. It was a bad decision in 1967, still a bad decision today.
Expand exceptions to all systems and software, or remove the prohibitions, of anticircumvention laws. That's Section 103 and Section 1201 of the DMCA. If someone can do the job better, let them hack the thing.
- More broadly, remove the DMCA in its entirety. Yes, it has been used to prevent removal of licenses for open source projects, but the harms are more numerous than the good.
Remove the safe-harbor provision of SEC Rule 10b-18. Make Stock Buybacks unambiguously illegal again.
Reinstate deference doctrine.
Failure to pass a budget results in an immediate election for the body that failed to authorize the budget.
Require all congress-persons and judges to put their stocks and retirement into a blind trust / management group while in office. Maybe spouses too.
Pass a universal healthcare / medicare expansion for all. ("Public option")
Pass a universal basic income system, with a wealth/asset tax to support it initially. ("Universal welfare"?)
Public Housing (not Section 8) with both initial investment and ongoing funding for maintenance. Raise the cost-cap to meet "modern" costs.

Executive Agencies (FCC, FTC, EPA, IRS, etc)

Work with the legislature to legally support and motivate the Agencies to investigate and punish-as-needed every single company over 100 people in size. Start with the biggest and push through verdicts (ideally with the enlarged staff of judges).
- IRS audits, particularly of individuals with high wealth.
- NLRB audits
- FCC review of telcos
- FTC and SEC investigations of every FinTech company
- EPA review of every datacenter and local manufacturing/power plant
Special note: leverage the Antitrust division to take on Sinclair Broadcasting Group and break up their ownership of local newsrooms across the US.

Observability is a scale problem

Wed, 07 Jan 2026 01:00:00 +0000

Warning: Hot Take

Observability (with "arbitrarily wide log events") is a solution to a problem that most fields and businesses don't have.

Measuring from First Principles

Most businesses are not pushing the boundaries of science or research. Most businesses are built to solve a particular problem and focus on providing services or products at a lower-cost or higher-satisfaction than their competitors.

This matters because it means the questions you will be asking of your data are already known or knowable before you begin.

When you provide a service to a customer, you're probably going to want to know about errors in service, latency or delays in processing, overall throughput of the service to customers, and your cost in resources to support it. There will be other additional questions you'll want to ask, but you'll want those baseline questions answered in a meaningful way before you find yourself able to answer higher-order questions.

This applies whether you're talking about a restaurant, a car manufacturer, or a software shop. Your key performance indicators (KPIs) are a well-known set of values and you don't have to look very far to apply them to your current process. That's not to say you aren't doing something unique in your business -- you probably are! But what you are not doing is something beyond the realm of process and procedure.

Why structured events though

Structured events allow for storing data with an often-arbitrary organization, of unlimited cardinality, and without necessarily needing to act on the input on receipt.

These are all wonderful properties. We like this; we want this; it is an unmitigated good for a business to manage its events in this way.

But... you said structured events are Bad

When looking at telemetry, you must split the relevant information into two categories: Business Events, and Operational Telemetry. The former is always high-value data and can/should be stored as structured events. Operatinal telemetry should Not.

Business events are things like "Person A bought X, Y, and Z" or "Customer R modified a deployment in zone Q" -- these are things your Business operations, turned into Events. These Events are part of your business history and are always worth the time and money to retain.

What isn't worth retaining are the Operational events. The fact that a particular operation took 16ms instead of 23ms on the backend service is not something your business is going to care about months or years from now. The fact that a particular operation took three retries to complete is not something the business will care about in six weeks. If they do care about arbitrary questions about those particular operations, then they need to be willing to spend exponentially-increasing volumes of money to retain and process said data.

Separate your business and operational data, then what?

This will take time and effort, but once you do successfully categorize your data into "operational" and "business" buckets, you can treat them differently. When new services are built, they can cleanly flag events generated for each.

Business events get long, expensive storage options. They get schemas and requirements, with clear reasons for the data to exist.

Operational events get weeks or months of lifetime before being compacted or deleted.

No need for expensive historical Splunk queries to figure out why your service is slow -- collapse your data into point-metrics and do that. Even if cardinality is high, you're storing a small dataset and adding more points for tracking is a small addition to the set.

My datset is already small...

Then literally anything will work for what you have. You don't need honeycomb, Splunk, DataDog, or any of that -- you need grep and a handful of refurbished 26TB disk drives for storage.

GoToSocial v0.20

Tue, 23 Dec 2025 01:00:00 +0000

I want to run a fediverse server on my FreeBSD host. I don't want to bother setting up Jails and I don't really need to run ruby or nodejs on my server when ActivityPub is a standardized protocol. I decided to look around and found GoToSocial. It runs as a single Go binary and, if you don't want to install postgresql, can even host a low number of users on sqlite.

What's not to love?

Well, there are a very few things not to love. The most recent issue was that my sqlite-backed instance decided to stop updating / receiving updates from peer instances. I can't quite pin down why that was an issue except that, for some reason, the instance started declaring that my proxy no longer passed on the real IP address of my client to the instance.

As a bit of background, I had to change over my network hardware due to frequent restarts and failures with the Eero Pro kit. Now, I have a Dream Router 7, an 8-port 2.5Gbps switch, and new access points in the house. Because my switch is only a Layer-2 switch, I wouldn't expect it to be able to short-circuit the routing table and forward my request directly to the server... but the DR7 could. That is easily remedied by adding an exception for my local IPv4 network range, but I still find it weird that I had to do it in the first place.

Still, that did not fix my no-update problem, so I elected to migrate from Sqlite to PostreSQL as the backing database. At the very least, I could poke around in the DB without risking a strange multi-writer issue or some other problem. Setting it up was easy -- migrating my older data is what caused problems. The export/import function explicitly doesn't support moving over favorites, saved posts, and a half dozen other objects. Suffice to say, until GoToSocial actively supports export/import of those particular functions, I will recommend everyone just install and use postgres as the backing DB. It's not worth losing data to try sqlite.

Since I know Go, I should be able to write my own fixes and propose them in the main repo, so this is my way of complaining publicly while I look at what I can do fix it privately. I won't be a hypocrite -- OpenSource means "you got fingers, you go fix it."

Big Drive Is Here

Wed, 22 Oct 2025 01:00:00 +0000

It came!

-- The certified refurbished Samsung Exos 24TB drive has arrived.

After doing a partition create on it, it showed the drive had only 22TB of counted, usable space:

$ gpt list ada1
...
first: 40
entries: 128
scheme: GPT
Providers:
1. Name: ada1p1
   Mediasize: 24000277209088 (22T)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
...

and once I created a single-disk pool for it, that usable space dropped to 21.8TB:

$ zpool list bigdisk
NAME      SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
bigdisk  21.8T  2.47T  19.3T        -         -     0%    11%  1.00x    ONLINE  -

But... it's here, and populated with data I was carefully mirroring on external platter drives.

I think now I can start looking at what it takes to build my backup / offline service collection...

Mirroring zroot on FreeBSD 14

Sat, 18 Oct 2025 01:00:00 +0000

Followed the directions from: https://blog.frankleonhardt.com/2025/add-mirror-to-single-zfs-disk/ but I don't want to lose the notes, so I'm copying what I did here.

Move all data off the 4TB ssd (ada1)

Run the gpart function to move bootcode over:

gpart destroy -F ada1 ;
gpart backup ada0 | gpart restore ada1 ;
gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada1 ;

Took the "ssd" zpool offline, and then attached as a mirror:

zpool destroy ssd;
zpool attach zroot ada0p4 ada1p4;

Because the partition of the 1TB is not as large as the new 4TB drive, I waited until the resilvering completed, and then updated the end partition from 800GB -> 3.5TB.

gpart show ada1;

# Resize without a "size" flag takes all remaining space
gpart resize -i 4 ada1 ;

gpart show ada1;

Now I need to sort out how to break the mirror and use only the 4TB disk.

I can't facepalm hard enough

Turns out efiboot is a royal pain and for some reason kept referencing a device that didn't exist in /etc/fstab when I booted using the new drive.

To make edits to the root filesystem, I had to turn that on for ZFS:

zfs readonly=off zroot/ROOT/default

# Since we're here, let's make it autoexpand the pool
zpool set autoexpand=on zroot

For fstab, I had to change the values from:

# Device        Mountpoint  FStype  Options     Dump    Pass#
/dev/gpt/efiboot0    /boot/efi  msdosfs rw      2   2

into:

# Device        Mountpoint  FStype  Options     Dump    Pass#
/dev/ada0p1     /boot/efi   msdosfs rw      2   2

Once that was over, I was able to boot into a normal environment with all my usual stuff.

That lead to me seeing the (original sized) 800GB zfs root partition:

$ zpool list
NAME      SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
archive  4.55T  2.44T  2.10T        -         -     0%    53%  1.00x    ONLINE  -
zroot    800G   29.8G  737G         -     ~2.7T     5%     9%  1.00x    ONLINE  -

What?! I thought this was supposd to automatically grow the space!

zpool online -e zroot ada0p4 and then we have:

NAME      SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
archive  4.55T  2.44T  2.10T        -         -     0%    53%  1.00x    ONLINE  -
zroot    3.51T  80.8G  3.43T        -         -     1%     2%  1.00x    ONLINE  -

Successful generation of RSS in shell

Tue, 30 Sep 2025 01:00:00 +0000

Finally got RSS generation working in 79d0260, then c736893 added a pubDate tag to propertly update/sort articles.

I got here after trying to shove yaml and metadata into pandoc and get it to generate the RSS xml. When I tried building a yaml file with the full text in the description, it definitely didn't work.

After trying to make the metadata file as json instead of yaml, then getting parser errors as I tried to put strings in the json file, I finally realized "This is XML. Why don't I just build it by hand?"

And that's where I got to. (Almost) Entirely shell constructing a clean RSS feed. (Pandoc was necessary to translate the markdown to html.)

But what of new smartphones?

Mon, 29 Sep 2025 01:00:00 +0000

Apple gave its big "here's all the new iPhone stuff we're releasing" announcement on September 10. Part of $DAYJOB is to watch those, and it had me thinking: what do I really do with my personal phone?

When I need a phone For Work Purposes, the iPhone stands out as the winner. A company can do literally anything to provision one of those highly-capable units and it will rise to the task every time. Fantastic performance, (usually) multi-day battery, and a plethora of accessories and compatible bits. If you hate tinkering, don't want to think about it too hard, and need something that Just Works for almost any occasion, iPhone is the answer.

My $DAYJOB-provided phone works spectacularly well. It has full Mobile Device Management (MDM) installed on it, a VPN the office manages, and plenty of "you will use only [these] apps" in the management panel. It carries my digital keys for accessing the office, provides me ~60% of the access and capability that my laptop would, and ensures I'm always reachable during my on-call rotations. (Yes, I'm on-call for things. That part kinda sucks.)

This is good news! Having a company-managed and company-provided handset keeps me from doing "personal stuff" on my work phone and work stuff on my could-be-compromised-by-random-goons personal phone. I love this arrangement and request it of every company that wants to get ahold of me in an official way "off the clock." When I look over at my personal phone, I start questioning whether I need the latest iPhone Pro model. There are almost zero days where I'm doing anything more strenuous than watching content (YouTube, Netflix, etc). Most of what I do is listen to music, navigate with mapping software, and... browse the internet. Oh! And use signal.

So... what then do I need a super-powerful phone in my personal life for? I don't find myself carrying around a dock like with Samsung Dex. I have used a bluetooth keyboard and mouse with my phone for international travel, but if I'm traveling for work, that's what the laptop is for. Typing on-phone while I'm flying somewhere is a limited luxury and it still requires an internet connection.

Let's lay out the requirements then, and see what I really need out of a phone, rather than guessing:

Personal Phone Needs

I'll organize the following lists in terms of frequency / importance for "normal" daily functioning.

Must Have

Conversation (Text, Video calls, Audio calls)
Email
Maps (Navigation, searching for locations)
Photo management (iCloud, Google Drive, self-hosted?)
VPN

Pretty Important

Fi-collar (Dogs)
Garmin Sync (Watch)
Home Management (Eero Wifi, Smart Lights, iRobot, etc)
Banking/Credit card stuff
Wallet functions (Apple Card? Tap to pay? Insurance, airplane passes, etc)

Nice to have

Socials (Mastodon, Bluesky, Discord, etc)
Streaming (Netflix, Paramount, Prime Video...)
Car Apps (MyChevrolet, MySubaru, Triumph connectivity)

Now that we have the list, let's consider what the options are.

Conversation: Signal does it all
Email: thank goodness for standardized imap clients.
Maps: Magic Earth (https://www.magicearth.com/) seems to be Murena's third-party solution to the "Private Maps" issue. ** See notes below.
Photos: (TBD self-hosted option) Considering internet-advice of https://syncthing.net/
VPN: https://mullvad.net/ - enough said.

That... actually seems to handle the core services I absolutely need on my phone. I can use the mobile web browser to do stuff like my banking and credit card apps, access socials, etc.

The big question remains about maps. Maps are now also a mix of business directories and locations. I could try using OsmAnd for community maps, but the issue is that those are always behind. Google and Apple at least have monetary incentives to keep their mapping software updated and active.

How much data do I have, really?

I currently have ~800GB of combined/family data stored in iCloud. Looking at the rest of my data, I've got quite a bit of music downloaded, so I'll have to see what that looks like.

...

About 14GB of downloaded music. That's not terrible.

iCloud drive says I have ~5GB of "files". Mail is around 200MB (and I already have that moved elsewhere). My personal photos are approximately 178GB... and Apple-Messages content is 26GB.

Looking finally at offline maps, I have about 3GB of the PNW saved, but that will absolutely vary when I need to travel to other locations.

This is starting to look... possible. Like, really possible.

Next steps

It looks like I have three things that I really need to replace: Photos, Maps, and Music.

To research:

Immich for photos
OsmAnd and Organic Maps for mapping/navigation and reviews
Music ownership options. Not streaming but actual downloads.

More blog construction

Mon, 29 Sep 2025 01:00:00 +0000

The blog continues to evolve in structure and design to make it easier to generate the blog (and maybe, finally, enable rss).

While trying to figure out what I need to generate RSS, I realized that I had a bunch of publish-date data inside the files (i.e., "Published: YYYY-MM-DD") that I would have to parse and extract to emit the files in correct order. Similarly, I had no way to generate the index page when I add new posts. I couldn't use filesystem access or creation metadata because these files move around a lot. I also couldn't use git data for the same reasons.

The fix was to move all posts into a Y/M/D path organization and treat the path as the sorting key for the index. Full details of my changes start at cdf2896 and end around 671a4d7. If I post more than once per day, I can add filenames like "1_foo", "2_bar", and so on to retain proper sorting.

Update: I, in fact, did this today. It works!

Now that the files are organized by publish date, I'm investigating what it takes to reprocess the posts into a full-text rss feed.

More analysis and data to come.

Pandoc to the Rescue

Tue, 19 Aug 2025 01:00:00 +0000

Published 2025-08-20

When I set up this blog on BSD, I started out using cmark. I didn't want to invest in jekyll, hugo, or any of the other whole-static-site generators. I wanted to invest in small, focused tools, surrounded by shell scripts. For a short while I even entertained using bashblog, but I couldn't quite make it work the way I wanted.

I tried poking around with lowdown and thought it could help me build an rss feed but it turned out to be no better than cmark.

Finally, I realized pandoc was sitting right there. It would take in most formats I wanted and produce any number of results I wanted, so I decided to adopt it instead.

Pandoc produces very nice standalone HTML documents, but some of it was... missing something... when I'd manage the headers and footers manually, only relying on pandoc to handle the inner markdown. For instance, my block quotes were "only" indented -- they had no vertical bar or other indicator it was a separate reference. That made me suspect I was doing something wrong, especially when I could see nicer formatting on websites like this.

Turns out, I needed to build my own template. Pandoc keeps all of its default templates as internal representations, so you can't just edit a file and change the defaults. However, pandoc very helpfully allows you to emit those default templates for yourself:

pandoc --print-default-template html > html.template

Once I did that, I was able to add my own edits and modify my script accordingly. Now it looks much cleaner and has plenty of (useful) added style.

P.S. -- Tweaking templates in pandoc has me considering what it would take to generate an RSS feed from the data, with XSLT styling too.

Self-hosting is not hard

Thu, 14 Aug 2025 01:00:00 +0000

Published 2025-08-14

Drew Lyton wrote a blog post titled "The Future is NOT Self-Hosted" and while I can agree with some of what he says, his premise is flawed and the conclusion is off-the-mark.

Let's start with the argument's foundation.

Content is licensed, not owned, now

Companies like Amazon have been playing dirty with Digital Rights Management (DRM) since the Internet's inception. Purchased digital goods have always been licenses more than owned assets.

It's not hard to accept that between streaming, DRM, and (regional) lock-in for devices, companies and "license holders" have their entire business plan hinging on the fact that consumers do not own the product. "Subscriptions" makes Wall Street happy even though it doesn't make sense for most products and solutions.

No argument with Drew here.

Self-Hosting Definition

Self-hosting is when you have a computer in your house do those same things [as Cloud providers], but on a much smaller scale. You maintain the hardware. You set up the servers. You manage the applications. You store and backup your data. You troubleshoot issues when things (inevitably) go wrong.

If that sounds like a lot of work, it is.

This is where I start to call bullshit. Yes, the idea is that you run services that replicate the function of what you need.

The key word there is need.

Do you need a live-editing collaborative document system where you and N-number of other people are actively updating a single document? Probably not. It is neat to have two people concurrently editing a document but it is absolutely not required to have a useful, multi-user editing experience. If you really do need concurrently collaborative editing, there's always etherpad or multiple "Google Drive"-like options literally designed for self-hosting-scale.

The same could be said for having a "hosted streaming platform." Streaming of media is a matter of downloading data to the device but preventing the user from accessing or saving the already-downloaded copy of the data. When you use a client that permits turning your "stream" into a persistent file (like yt-dlp), all that "streaming" turns immediately into downloads you can replay or share to your heart's content.

If you shift your thinking such that your device will get a download of the data, then what you're looking for is a library or catalog manager, not a live streaming and/or transcoding platform.

(If this sounds familiar, that's because we have had good, polished, workable solutions to the above problems since the 2000s.)

So, what's on my home server?

This section just... feels designed to overwhelm the reader with unnecessary complexity.

That's kind of the point. All of that took 138 words to describe but took me the better part of two weeks to actually do.

Okay, bro. Way to make your point by sounding scary. You can also buy a synology NAS (or this one) with plenty of CPU and storage to do the job you had to cobble machine parts together to do.

But in the end, I had four open-source alternatives to popular cloud-based apps running on my home server:

And now we get to the point: run a couple of applications on a network-storage-capable computer and you have a pretty good approximation of what you spend between zero-and-thirty USD a month to have from Apple/Google/Microsoft/Netflix/Paramount/etc.

To add to the evidence, here's what the author said on the ycombinator thread for his post (full thread here):

Yeah, at one point in writing this article I had a brief aside about more "off-the-shelf", accessible solutions to self-hosting like Synology. But I cut it because I honestly don't think they make the process that much easier. They help with hardware, but the software setup I think is still pretty difficult. Thanks for reading!

Your entire point is to bias folks against self-hosting so you can make your next point which is...

Right now, I'm self-hosting a private Google Photos alternative. It's fully owned and only accessible to me and my wife.

So, how do I create a shared photo album with my friends where we can all upload pictures from our latest trip?

[. . .] without exposing our services to the public internet and forcing our friends to signup for our weird app

You... expose the services to the internet and you set up accounts for your friends and family. This part is not hard. The solution is right there in front of you. Maybe you're hesitant because you serve your stuff over plain HTTP (as evidenced by your screenshots) but we have Let'sEncrypt, so... why?

But okay, your point is that you want us to think sharing like this is hard...

His Vision

But just like the suburbs, this vision is incredibly inefficient and detrimental to creating vibrant, interconnected communities. It necessitates mass amounts of duplicate, unused infrastructure and requires each household to be individually responsible for maintaining that infrastructure. It silos us and makes it harder to share resources.

And what do we get in return? A worse experience than cloud-based services.

This is just exasperating. You're not "duplicating infrastructure," you're establishing infrastructure for you and those who you can support. You have the capability and capacity to set something up for you and your immediate family. What would happen if you offered hosting to your friends and broader family?

And, again, it's really not duplicating infrastructure to have a storage-box for your data. That's like saying "the county and the IRS have all my important records, so I'm not going to get a file cabinet (for my important documents)."

At some point, everyone is required to manage their irreplaceable document "stuff." Those with a permanent address will do things like buy a safe or filing cabinet and cart it around every time they move. That's the equivalent of the digital self-hosting solution. Nobody says they're "wasting space" or "duplicating infrastructure" -- it's seen as a smart investment to avoid catastrophic loss.

Those who don't have a permanent home address seek out alternative solutions:

Storing their important paperwork with family (parents, siblings, children, close relatives, etc) or friends.
Renting a safety deposit box from a bank (i.e. equivalent to the Public Cloud VPS)

These are well-known solutions to RV-ers and Van-lifers without a permanent address. We've had these solutions for decades and they work pretty well. What you have built right now is a really good "fire-proof safe"-equivalent as option-1 for your friends and family.

It empowers individuals, but it doesn't reject the power dynamic itself. It creates a system where everyone has to provide for themselves instead of a system that provides for everyone.

Comically missing the point.

But if we want to live in a world where we are not bent at the knee to corporate lords and also don't fall victim to the myth of self-reliance and rugged individualism, we need to think radically differently about how we create communal, shared internet infrastructure.

Uh... wow, you keep leaning into that false dilemma (perpetuated by corporate interests) that there are only two outcomes here... what's your angle?

Instead of building our own clouds, I want us to own the cloud. Keep all of the great parts about this feat of technical infrastructure, but put it in the hands of the people rather than corporations.

I'm talking publicly funded, accessible, at cost cloud-services.

. . .

The devil is in the details. But I know this world where we are all free from our corporate landlords through solidarity, mutual aid, and shared, community-owned, privacy-focused, internet infrastructure is possible.

. . .

"How do I build my own cloud?" is the question that inspired this journey. But, "How do we build a better cloud?" is the one that I'm left wondering as it comes to an end.

Comically missing the point, part 2.

You're literally employee number one of your future Co-Op, owning this infrastructure. You went far enough to mostly solve your own problem -- why not take the last step and start truly sharing?

Conclusions

Drew staked out a position on something that got everyone talking about it. If this is a false-flag operation, then, applause - what a success.

Ultimately, yes, I agree - I'd love to have municipalities and governments provide "public cloud" like infrastructure. I think that's a viable way forward for a lot of "bigger" issues that governments are currently paying large sums of money to private entities for.

That said, Drew specifically points out that his proposed communal system needs to work together with private systems:

Even with this technical solution, I think it makes sense to envision a world with private, market-provided options.

My friend, I'll say it again: that is you. Be the change you seek in the world. Provide services and leverage open protocols (like ActivityPub, IMAP/SMTP, RSS, etc) so that your customers aren't locked-in to your solutions. If you have the skills and the desire, you start up the co-op and build the solution.

Self-hosting is the start of this entire municipal / co-op owned infrastructure you're hoping for... you're just so focused on yourself and the false-dilemma of trying to provide "cloud like" solutions that you can't see it.

Anubis on git.gluecode.net

Tue, 24 Jun 2025 01:00:00 +0000

Posted 2025-06-23

TL;DR - Anubis is now protecting git.gluecode.net

I keep a couple of local mirrors of big, public repositories so I can do various git things locally on my server. Repos like mediawiki-core or the linux kernel or servo are fairly large and pulling them regularly as a full mirror means I don't have to worry about whether or not upstream is available.

Somehow AmazonBot, Claude, ChatGPT, and others found my server. They decided it was a good idea to click on every comparison link (repeatedly). They refused to honor robots.txt, so I put a 403-redirect for all "bot" user strings in nginx. That took out the "polite" bots that modified their user-agent string, like AmazonBot and ChatGPT, but there was a non-zero contingent of bots that did not care and proceeded to keep hitting my server.

Anubis saves the day

I did the somewhat-smarter thing back when I configured nginx and made cgit run within fcgiwrap, instead of the original apache/bozohttpd method of direct CGI invocation.

By running in fcgiwrap, I was already prepared to inject Anubis into the workflow. (Thank goodness for Xe caring about getting Anubis functioning on FreeBSD!) After following a few instructions to move my nginx-server directive for the cgit instance over to a socket-only listener, adding anubis was... easy. Nginx is both the TLS frontend and the backend for anubis to do its thing, so I get full protection for my computationally-expensive subdomain.

I've since "enhanced" the policy definiton for anubis to be a bit more brutal with bots, but the answer to every bot-like client will be "Absolutely Not". I'll work on allow-listing groups like the Internet Archive, but they don't seem to play by any rules, and all of this code will be backed up separately anyway.

Thanks again to Xe. I made sure to increase my donations, as this is more than worth the money for an otherwise free product.

If you want to use this for commercial purposes, make sure to sponsor development. Your company is worth spending some money to make sure you survive the current/oncoming LLM crawler onslaught.

Systems Self-Defense Part 3

Thu, 05 Jun 2025 01:00:00 +0000

Published 2025-06-05

See Part 1 and Part 2

No, no, wait, hold on

So y'all may have been a bit surprised I brought up redis (or valkey) as a data-persistence option and then say "nah, we're not doing that."

"But wait," I hear you say. "Redis and Valkey persist their data, just like postgres and sqlite... why are you giving that up? They're fast, keep their data in-memory. What's not to love?"

True, as an SRE, I prefer to over-build things (to a point). I go for twisted pair wrapped in insulating plastic, not dry-core, paper wrapped lines. You might say "ah, that's overkill" but I like to plan for a little bit of future-proofing.

So what's wrong with valkey, or any other in-memory sharded solution? Durability and Scale, at precisely the point where you do not want scaling problems or questions about your persistence.

First steps

Let's look back at Part 1. In it, we had a single service instance that could only write as fast as the underlying filesystem (and operating system) could handle.

For a 7200rpm spindle drive, the disk can handle 50-100 IOPS. For an older, middle-rank SATA SSD (like a Samsung EVO 870), that hits around 100k IOPS. SATA supports upwards of 6 Gbits/s, but unless you've got over 10Gbit network interfaces (a step-up from most common consumer hardware, which caps out at 2.5 or 5Gbit), the service will hit a bottleneck at the network card before it hits the disk limit. Let's not even consider the (basic) NVMe drives that can easily hit 10x the SATA SSD IOPS.

Even from there, the operating system, the CPU, memory, or the service implementation in code can all provide hard limits to what can be handled at maximum speed and, ultimately, those limits are constrained to a single instance, a single machine.

Running a simple stress-test against an HTTP service, replying in-memory, we can get anywhere from 7k RPS to 40k RPS depending on how the implementation is optimized. That's about the limit of a single host -- we can do more than ~100k RPS with some clever tweaking, or reducing the amount of work, but let's consider that a single instance doing meaningful work will theoretically max-out at 100k RPS.

Once we start spawning multiple instances, across multiple machines, our limits are now (theoretically) unconstrained by a single disk or a single network adapter. That "100k RPS" can be multiplied by N number of additional instances. We're now limited by how quickly those instances can communicate with each other or some other, external system.

Durability with Valkey / Redis

This brings us to Redis/Valkey. Note, I'm going to refer to Valkey as "redis", but I do understand the distinction both in implememntation and licensing. Let's set aside the licensing problem and look at the available functions.

You can read directly from the Redis team about benchmarking redis and pitfalls of the service here. If we review some of the risks/issues associated with redis, it says (emphasis mine):

The general indication you should use both persistence methods is if you want a degree of data safety comparable to what PostgreSQL can provide you.

This statement alone should make us all pause for a moment: if postgres is safer than redis, what is redis doing for persistence?

It has a point-in-time snapshot (RDB) that requires duplicating the instance and writing the dataset to disk, and an append-only-file (AOF) that writes the operations to disk as they are sent to the server. Both of these persist a recovery-style file to the filesystem. If your data is important, the combination of RDB and AOF is necessary to ensure complete recovery of your persisted data.

We have to ask ourselves, "is this data critical to our service?" Considering our only function is to accept a value and return a key that corresponds to the value, yes, this data is very critical to our service.

Durability with Postgres

Postgres may be looked at as "just another database", but it is well-known for being customizable, flexible, and highly reliable, unencumbered by licenses or restrictions on its use.

Postgres is a disk-first storage system -- while it will hold data temporarily in memory, its first priority is persisting the relevant bits to disk. This means that your maximum dataset size is not limited to how much memory your host has but instead is limited by the filesystem.

Due to the SQL-nature of its implementation, Postgres is also able to handle complex data objects in a way the other systems cannot. While that is not necessarily relevant to our current system (text -> key), allowing more complexity through adding fields/columns, ownership, access controls, etc, would require extensions in other environments. With any SQL-based relational database, such a request is usually "just another column/table."

That additional complexity often comes at a performance cost, but what we lose in performance is also buoyed by multiple levels of validation and write-durability. In the case of Postgres, it prioritizes accuracy in a query over performance, such that a single INSERT (or even a SELECT) can trigger multiple writes to disk. Over time, Postgres will optimize files-in-memory to ensure speedy delivery of commonly requested data and, at any time, can provide analysis of where indexes or tables are consulted via the EXPLAIN directive. Anecdotally, postgres handles 10TB+ databases with the same speed and efficiency as it would a 10GB database.

Durability in Redis

By comparison, redis with AOF writes every command to disk and can support some complex structures such as transactions. It even has a ft.explain directive, yet the operations are expected to occur entirely in memory.

Compared to disk, memory is both expensive and limited in size. The largest memory capacities (as of 2025) are 512GB or 1TB per chip, at approximately $15,000 USD per chip. Compare this to a refurbished 26TB hard drive that costs $290 USD (around $11/TB).

Let's also consider this separate from cost: should a dataset exceed the size of available system memory, not only will redis be unable to hold more data, it won't be able to persist said data to our AOF or RDB files.

Durability with redis, therefore, is limited to what memory your system supports and per discussion in part-1 is a limited-availability resource.

Benching Valkey

If we stick to single-item GET and SET, redis can handle 120k RPS on a multi-core machine with gigabytes of memory allocated. The documentation also flags these as O(1) operations, indicating that they can be completed in constant time.

If we start using the mSET and mGET, the throughput drops by at least a third (to 70% original throughput) and only gets worse as the set size increases. If we try and mitigate this by switching to the hash-oriented operations, we run into the same single- vs multiple-key drop in performance.

Testing Valkey locally on a relatively unloaded dual-xeon machine, the system supported ~135k RPS using SET and GET, but only up to 95k RPS (30% drop!) as soon as "only 10 keys" were done using MSET:

$ valkey-benchmark | grep -e "======" -e "throughput summary"
====== PING_INLINE ======
  throughput summary: 131406.05 requests per second
====== PING_MBULK ======
  throughput summary: 135135.14 requests per second
====== SET ======
  throughput summary: 135501.36 requests per second <--
====== GET ======
  throughput summary: 134589.50 requests per second <--
...
====== HSET ======
  throughput summary: 136239.78 requests per second <--
...
====== LPUSH (needed to benchmark LRANGE) ======
  throughput summary: 134408.59 requests per second
====== LRANGE_100 (first 100 elements) ======
  throughput summary: 73800.73 requests per second
====== LRANGE_300 (first 300 elements) ======
  throughput summary: 30646.64 requests per second
====== LRANGE_500 (first 500 elements) ======
  throughput summary: 18910.74 requests per second
====== LRANGE_600 (first 600 elements) ======
  throughput summary: 15933.72 requests per second
====== MSET (10 keys) ======
  throughput summary: 95969.28 requests per second <--
...
====== FCALL ======
  throughput summary: 136798.91 requests per second

You can also see a precipitous drop in throughput as the LRANGE operators grow in size. None of this is exactly a problem -- it tells us where the boundary lines are for our solution!

If redis can handle up to 135k RPS for small values, then so long as we limit our throughput to around 115k RPS (85% of max), we're great!

While 115k RPS seems like a lot, consider that you have to divide that up across all of your service instances. So if you have ten instances concurrently accessing the redis server, an even distribution would be: 115_000 / 10 = 11_500/instance

Your instances might only be able to sustain around 12_000 operations per second individually, if they're all working continuously.

Benching Postgres

(baseline)
$ pgbench -c 10 -j 2 -t 10000
query mode: simple
number of clients: 10
number of threads: 2
maximum number of tries: 1
number of transactions per client: 10000
latency average = 5.271 ms
initial connection time = 16.112 ms
tps = 1897.185198 (without initial connection time)
--
$ pgbench -c 100 -j 12 -t 10000
...
number of clients: 100
latency average = 16.658 ms
tps = 6003.288229 (without initial connection time)
--
$ pgbench -c 1000 -j 12 -t 10000
...
number of clients: 1000
latency average = 171.768 ms
tps = 5821.795094 (without initial connection time)

Postgres looks like it can handle about 6k transactions* per second (TPS) across a multithreaded system. Anecdotally, I've seen systems hit upwards of 8k TPS, but let's assume a relatively unoptimized machine is our first attempt.

6k is definitely below 115k, but we're talking about a transaction-based unoptimized benchmark as well. Depending on how our dataset grows and what the EXPLAIN command tells us, we can add indices, make optimizations to how we query the data, or even read replicas to offload read-only queries to a separate instance.

* Note: While we refer to "transactions" as opposed to "requests", this is due to Postgres handling many operations through a (potentially) multi-statement transaction, rather than the simplicity of a "GET/SET" statement in redis.

Boundary Conditions and Data Volumes

One thing we've hinted at but not developed is how much data our service is going to need. The way our design is constructed, we "accept text" but never defined how much text is "too much" -- what if someone sends us the entire body of War and Peace as a single text string?

While intermediary systems might reject that particular example before it ever gets to us, we should be in a position to set an upper-limit on the bytes we are handling per-request, per-transaction, with our service.

Let's say we want to accept a maximum length of 100 characters in a single request, and we'll return at most a 64-bit unsigned number (i.e., 18_446_744_073_709_551_615). Because UTF8 encoding can use up to 4 bytes per "character", we need to account for 4 x 100 bytes (plus some padding for encoding, wraps, etc) for the text we are to receive. Because very few implementations will allocate "400 bytes" for an object, we should probably round-up our measurement to 1kb (1024 bytes) per received object. (This also let's us raise our acceptable length limit in the future, should we choose.)

If every line used the maximum size value, napkin math can give us some early "breaking points" for various solutions.

Redis is limited to memory and let's say the "average" server has at least 64GB of memory to dedicate to the process. If we save ~4GB for the operating system, that leaves us with 60GB to allocate. If we use ~1kb per entry, that leaves us with 62_914_560 or around 60M records before we start to exhaust redis' ability to store data. We can spend significantly more money and get a machine with 256GB of memory, but the added cost only buys us up to ~250M records (a 4x increase).

After that, in order to scale, we need to talk about sharding the dataset and running multiple instances of redis, and coordinating their operations and backup.

For Postgres, as long as our operations per second don't exceed ~5500, we can persist as much data as our server-disks can hold. The Dell R750 rack server comes with a single, default disk size of 600GB. If we carve out 100GB for the operating system and other things, that leaves us with 500GB for the database, which provides us a minimum of 500M records before we start looking for additional storage options. Since we know this will be a database server, we can purchase six or seven third-party SAS drives for ~$14/TB and immediately add 100TB of usable storage.

(If you want to use AWS or some cloud provider to manage your machines, you can always use the EBS-equivalent and scale a single disk up to 64TiB, according to their documentation as of 2025. That gives you 63-billion records without changing anything about your database processor or memory.)

Conclusions

Ultimately, what we have in the choice of postgres, valkey/redis, or any other data storage solution, is a set of trade-offs. While redis may be extremely high-performance, it has a limited storage capacity without expensive scaling options. Postgres has significantly higher storage capacity, but is limited to much lower rates of execution.

Each one of the solutions can be scaled, but when it needs to scale matters just as much as how it can be scaled. This leads us to a set of directives we can apply to any storage or execution question:

If scaling requires more compute capacity, that is more expensive than adding storage capacity. If we observe significantly more throughput demand than we do storage, lean in to the higher-performance option (redis). If we observe low-throughput, long-term data storage, and/or need flexibility in our future storage systems, take the higher-durability option (postgres).

Systems Self-Defense Part 2

Mon, 02 Jun 2025 01:00:00 +0000

Published 2025-06-02

Part of an ongoing series. (See Part 3 here)

Okay, you've built your service, you've set in-flight requests to 90% of what you know it can handle (for a given amount of CPU, memory, and network). It's a great service.

And now it's time to integrate with another service!

Oh no

So your service accepts client requests, does useful work, and then replies to the client. And now you want to add a network-call to that procedure.

... Why?

Let's set aside that line of questioning for a moment and assume you MUST integrate with a non-local service/datastore/cache for $DAYJOB reasons.

Persistence

Our service is popular, so we've deployed it across multiple servers. $DAYJOB SREs have put a load balancer (like HAProxy) in front of our services to spread request equally across them all. Because our service handles its own work and doesn't need things like TLS termination (almost always a bad idea) the SREs could do this without our involvement.

Now, $DAYJOB is worried the servers are doing repetitive work and they can't track of user patterns to show value to their investors.

Again, let's set aside the ickiness of the whole concept of tracking users -- it's gross and, depending how you do it, against the law in some jurisdictions -- but let's say persisting repeat requests is useful, and maybe users are asking to save some kind of operation history. Sure, yeah... the users are asking for it...

Caching?

If we're just looking to handle repeat requests, we could cache responses in-memory. This is a time-tested solution and often the first step in making algorithms faster if you find yourself doing repeat work.

As a service, we can do things like store the request parameters as keys to a cache, and set the response as a value. If they keys are the same, we just return the value right out of cache -- no additional computation necessary!

But what if the service restarts? We'll lose all that cached data!

There are many ways to store in-memory objects as a file, but if we're considering flexibility in our implementation we probably want to stick with a more portable solution than just splatting a cache object on disk.

There's the option of building a custom key=value format, or converting to JSON, but one of the fastest options is using sqlite. As developers, we get binary integration (it's a library you include with your application), portability with the application, and the flexibility of SQL and its associated data structures.

Data persistence

With sqlite, we can persist our cache data to a table (maybe with a table design like key, value). Because we're talking about SQL, we can have multiple tables to store any kind of data we want.

Sqlite gives us fast and flexible data storage, but it is only usable in a single instance. There are some options to stream transactions across multiple instances, but sqlite has a single-writer paradigm.

Shared Data Persistence

If we want our service instances to do work "only once" across multiple instances, we can't use a shared file -- we need an external service.

As the joke goes, "now you have two problems."

Sqlite is often compared to postgresql since they're both free-and-open-source (unlike redis as others might suggest), so let's have our service work with postgres.

The Two Problems

In order to write data to postgres, postgres has to accept the request, (ideally) ensure the operation doesn't conflict with anything else it's doing, do the thing, and then respond to the client.

Doesn't this sound familiar?

In other words, now that we've decided we're going to have multiple instances talking to a database, we are dealing with two services.

Note: We won't go into transaction isolation and other SQL-specific parts here.

With two services talking to each other, our model looks like this (with some liberties taken around the query request):

(Step 1) Client  --   "foo"   -->  --> "Get/Insert: foo" --> [Postgres]
(Step 2) Client  --   "1"     <--  <-- "1"               <-- [Postgres]
(Step 3) Client  --   "foo"   -->  --> "Get/Insert: foo" --> [Postgres]
(Step 4) Client  --   "1"     <--  <-- "1"               <-- [Postgres]
(Step 5) Client  --   "bar"   -->  --> "Get/Insert: bar" --> [Postgres]
(Step 6) Client  --   "2"     <--  <-- "2"               <-- [Postgres]

Instead of having our service write and serialize to disk, we now "write and serialize" to Postgres, which means that if we try and write the same value more than once, Postgres just returns the original result. Work is saved at the cost of two services and we have services which can operate in parallel.

We're web scale! 🤮

Postgres backpressure

Postgres will give us feedback when it is under load. Queries can take longer to complete, connections to the database are of limited quantity, and we can run bad queries. For a single service or client talking to the database, this isn't a significant problem.

Let's say we have hundreds of service copies talking to our single Postgres server. This replicates our original client-service issue but now our service is also a client. We must manage our use of postgres while still providing our clients the mechanisms they expect, which means we get to talk about retries, the "thundering herd" problem, and "race conditions."

Retries

As an intermediate service, our processor needs to determine what to do when Postgres has a failure during a transaction.

In this example, we leverage transactions to ensure our writes to the database only meet the conditions we provide (namely, that the value doesn't already exist somewhere). If instance-1 and instance-2 of our service try to write "foo" to our database simultaneously, one instance will succeed, the other will fail. For the succeeding instance, the answer is easy: send the response to the client.

But what about the instance that failed?

Let's say instance-2 failed -- there are at least two options: we can retry at the instance-level, or we can pass the error back to the client.

Service handles retries

So we want the service to handle retries -- after all, the client already submitted the request. We accepted their request and so we feel responsible for making it happen.

(This seems to be a common feeling among many.)

If we follow this path, client response time is dependent on how quickly we can perform the "get or put" function described above.

In the case of a race condition, we'd see the client response time for a request that "lost" the race take two or three times longer than the one(s) which succeeded. As a more concrete example, if the successful operation takes ~50 milliseconds, a failed operation could take up to 150ms.

Client handles retries

Let's say that either a successful or unsuccessful response will be returned to the client within ~50 milliseconds. The successful operator gets a return value in their reply, while the "losers" of the race condition get an error code within the same time frame.

So, upon encountering the error, we pass back an error code to the client such as HTTP 409 - Conflict or 412 - Precondition failed. One could even "misuse" the status code and send a 429 - Too Many Requests to encourage the client to retry their operation.

I say "misuse" the 429 code because it's really not the case that the server couldn't handle more requests, but the client doesn't know that.

The client can either abandon its request or try it again. If it tries again, with the same content, it gets a successful return from the server (as the information is already present, rather than being newly written to the storage system). This is a rather simple implementation both from the server side and on the client side, however, it does "make the client do more work."

Every time I bring up "let the client retry", people seem to get worried, like clients doing anything beyond asking for things exactly once is bad. I refer objectors to the MIT vs New Jersey school reference article and point out that if the client knows what's happening then they can make the choice for you instead of requiring the developer to handle every single retry / reproduce case.

Backpressure and the Thundering Herd of Requests

Whichever solution we chose above, we've defined a backpressure mechanism for our system -- the mechanism controls the inflow of information and prevents overload of any one component. Now to test it.

Let's say we become extraordinarily popular: our service has hundreds of instances, and multiple hundreds of active clients. This is great news! Popularity means we are useful and/or generating revenue -- good things for us as operators.

But then Something Happens and our database has a momentary outage. It comes right back online and can resume servicing requests, but our clients never stopped submitting data -- we're about to receive hundreds of concurrent, possibly duplicative, requests and Postgres might not be able to handle it.

If our service is handling retries, for each request to the service, we will:

Hold open a client connection
Make N number of requests to Postgres to fulfill the request

If our clients are handling retries, then for each request to the service, we will:

Make 1 request (possibly 2 depending on transaction limitations) to Postgres.

In the first case, our service amplifies every client request to the database, where the second does not -- each transaction is a fixed multiple (1x or 2x) of each client request. If a service amplifies request volumes, we're facing a situation where a denial of service (DoS) is achieved from within our service, with "normal" request traffic.

If we have a flood of requests coming from outside our service arrangement (i.e., from our clients directly), then we can either scale-up our service and postgres to handle more traffic, or we can not scale-up and allow the excessive-use responses to push back on clients to handle retries. (And, if any one client is being abusive, we can talk about rejecting that one client instead of disrupting the whole service.)

Consequences

As shown above, when systems degrade or fail, the way the system handles loss of function directly affects both the system itself and the clients using the system. By making design decisions that limit the load on each component, we build our system for resilience in the face of degradations and failures.

Defaults are the only answer

Mon, 05 May 2025 01:00:00 +0000

Published 2025-05-05

This post might sound like a "hot take," but I promise it's not -- it's backed by over a decade of working across startups and some of the biggest engineering organizations on the planet.

TL;DR

The only solution to maintainability is the default, from upstream or otherwise.

Wat?

Whenever I get a new phone or laptop, I like to customize the system. I change the wallpaper, the colors, sometimes switch the theme to dark/light/custom. Then I start customizing the terminal/shell, adding a different font, usually using something like git to clone my dotfiles repo, and then I start adding packages for utilities I prefer over the posix shell.

All of this customization is contained to my personal system. It's not something you, the reader, have to deal with. It's not something anyone who's not hands-on-keyboard or remotely shelled-in to my system has to care about.

I want to emphasize this point:

The customizations only affect me

This is why defaults matter. From a platform engineering and SRE perspective, you may want to offer customizations for those "rare" exceptions to the default, but a supermajority (the higher, the better) of customers must be satisfied with the defaults.

This leads us to two axioms:

If a setting captures a supermajority of users but is not a default, it must be made a default.
If a default cannot capture a supermajority of use-cases, you must remove the concept of a default entirely, forcing each user to specify the setting each time.

Consequences

We now have two key points:

"Default": a setting suitable for most (60%+, 75%+, etc) use-cases in a cohort.
Absent a single-setting Default (above), specify nothing for the user.

This type of exercise is often done once and then forever left alone. When it comes to "release to the public" utilities and services, then that type of one-off decision making is acceptable, if only because trying to "fix it" with the global population of users can be impossible. (See OpenSSL and default ciphers versus libressl/boringssl.)

When corporations build their own internal platforms, utilities, and software for corporate consumption, one-off decision making is a terrible design decision which leads to unmitigable tech-debt, increasing development and maintenance costs across all sectors which use said tool.

Take an example of a certificate management platform. Let's say you have two runtime targets, A and B, and you can generate certificate types of X, Y, and Z. Among those six possible combination options, if a majority are A:X and B:Y, you're in a position to say "All runtime on A by default uses Certificate Type X and all runtime on B by default uses Certificate Type Y."

If, however, you're not able to provide a single solution for those environments, any kind of default will need to be constantly overridden by your users. If you provide a default that matches only 20% of environments, why is that the default and not "nothing"?

Objections

It's always good to have a default.

Defaults make a decision that the user either has to accept or modify for their particular use case. If you don't choose a default, the user must make the decision each time.

Defaults save time only for those to whom the default applies. This ends up being a penalty for everyone to whom the default does not apply.

Why not have different things for different people?

Teams building solutions for a company need to make decisions that affect that solution for its entire lifetime. Defining a default means establishing an invariant in the code.

When a different team, or a different engineer, starts using the service or system, they're going to expect the unconfigured system to meet their needs. People don't immediately sit down and try to customize and configure a system they are entirely unfamiliar with; they will try the defaults first and then customize as they go.

Okay, what now?

You've got a service or a system with defaults already shipped. Well, that's going to be nearly impossible to change user habits if it's already in-use.

If you have the authority to do so, suggest removing defaults in favor of "required options" or change the default to meet the true majority of users (whatever that aligns to).

If you are building a new service, please do not specify a default. If you do know exactly what your users want and can show it with data, feel free to establish a default.

Systems and Self-Defense

Wed, 16 Apr 2025 01:00:00 +0000

Published 2025-04-16

Despite what the first Avengers movie would tell us, a system can protect itself from itself if built intentionally. To understand how, let's start with the following:

Tyler’s Law:

“Any system will inevitably be used to 100% of its authorized capacity.”

Tyler's Corollary:

"If your authorized capacity is equal to your available capacity, your system will fail."

"Authorized" is a term different from "available" capacity and they are not interchangeable.

Denial of Service (DoS)

Authorized capacity for a computer is not generally controllable by the end user (unless you've got root access), so a single process can consume as much in resources as the computer has available (with very few limits). If one runs a command to duplicate a movie file of "Plan 9 from Outer Space" fifty thousand times, like

seq 50000 | xargs -I{} cp plan9.mov plan9copy{}.mov

the computer will dutifully use all its resources to accomplish that objective until the job completes or the disk is full.

Note, there is no interactivity once executed -- the command accepts one set of instructions, stops taking input, then executes without indications of progress until the job is done.

A service, on the other hand, accepts input from another source and persists. As the service does work on a job, it may be coded to accept more inputs, and reply to the requestor with already-completed work.

This poses a problem for managing resources: how many resources are to be used for in-flight operations? Does the service have enough resources to accept new work while processing a current job? How does the service tell the requestor it's not ready yet?

If the system is using all of its available resources to do work, then there is no resource left to respond to a client/user, to process added signals (like kill/term), or even provide telemetry to an observer.

If a service or a computer "goes silent," how are we sure it is functioning correctly, if at all?

The Problem

The key point to the previous section is this:

If we are able to give a service enough work that it uses all of its available resources, then we've achieved a Denial of Service condition.

This is bad. To allow a program to "cancel" an erroneous command or be triggered to produce telemetry / feedback, the program must be able to listen for signals from the operating system and act accordingly. The only exception is SIGKILL, which cannot be blocked or handled.

The goal of the operating system / kernel is to ensure that "authorized resources" never exceed "avialable resources", or the system will crash. This is why SIGKILL is unblockable -- it's an action of last resort by the operating system (OS) to protect itself.

But what if we have an interactive service? We don't want to terminate the process if it gets stuck -- we want it to keep running. So how do we protect it?

Self-defense

All programs and services practice a form of self-defense known as response-codes or error-codes. They provide signals to an operator or requestor that vary from "I'm still here" to "Please try again later" or "this broke something" or even "your request is broken and I won't do it."

What does this look like in practice though?

Service Example

Let's say I have a web server that accepts text, appends it to a file, and returns a line number to the client.

(Step 1) ClientA  --  "foo"  -->  --> [write to disk]
(Step 2) ClientA <--   "1"   --

Because this service has to write values sequentially, while it is performing work for Client-A, it cannot do anything else. This means that if Client-A gives sufficient work to Service, it can't do anything else.

So let's look at what happens when we introduce Client-B:

(Step 1) ClientA  --   "foo"   -->  --> [write to disk]
(Step 2) ClientB  --   "bar"   -->  --> [BLOCKED]
(Step 3) ClientB <-- "TIMEOUT" --  
(Step 4) ClientA <--    "1"    --   --  [write completes]

Because this service has to write values sequentially, while it is performing work for Client-A, if Client-B wants send "bar" to the service, we must tell Client-B to wait or come back later.

By default, a TCP connection to a service will CONNECT, send data, and then receive a response. The operating system can multiplex TCP CONNECT requests, so if one is already active, it will tell the underlying network hardware to "wait" until it can open up another port.

Your application, however, can't see this "wait" condition -- it just goes silent until either the kernel accepts your connection, or you time-out and the kernel evicts you from the queue.

This is not a great solution. What could we do instead?

Do Nothing?

No really, what if we do nothing?

That's a pretty great situation for the developer -- zero work needed and the kernel/OS does the multiplexing. This, however, is a horrible experience for clients and service operators. Before specific solutions like running a dedicated HTTP service or providing an http stack in-process, the answer to making a service network-available was using inetd. It was (and still is) incredibly slow and does not scale beyond very low traffic rates.

So, in doing nothing, the service appears inconsistent, with periodically high latency, and does not actually fix the problem (see "Denial of Service" above).

Application Layer Defense

Most network services operate on HTTP, even ones that bind to unix sockets. Even grpc operates on http2, so I think it's safe to say I can leverage HTTP status codes as an example of how to respond to a client without requiring too much translation to other stacks.

When an HTTP server is "busy", rfc6585 suggests responding with a code 429 which maps to "Too Many Requests."

Of note is this paragraph:

Note that this specification does not define how the origin server identifies the user, nor how it counts requests. For example, an origin server that is limiting request rates can do so based upon counts of requests on a per-resource basis, across the entire server, or even among a set of servers. Likewise, it might identify the user by its authentication credentials, or a stateful cookie.

Let's refer back to our Server example. If we have Client-A handling a request, then any additional requests during the time Client-A's request is being processed would be "too many requests" for the server to handle.

So what we should so is configure the server to do two things at once (or, at least, two things concurrently). As pseudocode:

var locked bool

fn writeLineToFile(s string, f file) -> (success bool, l int) {
  locked = true;
  n, err = os.Write(s, f);
  locked = false;

  if err != nil{
    return false, 0; // We failed
  }
  return true, n;
}

main(){
  f = open("filename")
  requests = http.Listen(port)

  for r in requests {
    if locked {
      r.reply(http-429) // Too Many Requests
    } else {
      ok, line = writeLineToFile(r.payload, f)
      if ok {
        r.reply(line)
      } else {
        r.reply(http-500) // Error
      }
    }
}

Let's take a second to look at this -- the main code opens a file to persist the data and listens for requests on the HTTP port. All seems normal, until we get to the if locked section. If the lock is active, immediately stop what you're doing and reply with a HTTP-429. (For those using GRPC, you can swap "HTTP-429" out for metadata status code UNAVAILABLE(14).)

You'll notice that we haven't done any additional code checking -- no computation, no querying the file, just check the boolean value, then act.

This is, computationally speaking, very cheap to do. The benefit of the positioning also means that we short-circuit the operation before we even start down the computationally-expensive path of accepting the payload and doing anything with the file.

Looking further into the writeLineToFile function, you'll also see that we "lock" the file for the minimum part of the operation -- the part where we actually write to the file. We don't lock the error-checking portion of the code and we don't lock the reply to the client. This means that while the code is doing things-not-writing-to-files, we can go as fast and as concurrently as we want.

If we increase our activity to three clients (A,B,C), our new operation graph looks something like this:

(Step 1)  ClientA  --  "foo"   -->  --> [write to disk]
(Step 2)  ClientB  --  "bar"   -->  --> [BLOCKED]
(Step 3)  ClientC  --  "quux"  -->  --> [BLOCKED]
(Step 4)  ClientB <-- "E: 429" --  
(Step 5)  ClientC <-- "E: 429" --  
(Step 6)  ClientA <--   "1"    --   --  [write completes]
(Step 7)  ClientC  --  "quux"  -->  --> [write to disk]
(Step 8)  ClientB  --  "bar"   -->  --> [BLOCKED]
(Step 9)  ClientB <-- "E: 429" --  
(Step 10) ClientC <--   "2"    --   --  [write completes]

You'll notice at Step-7, ClientC tried again, faster than ClientB, and they won the race! As we reply to clients, they can decide when to try again, and since ClientA's request was fulfilled in Step-6 the lock was lifted, allowing ClientC to make a write.

Regardless of who wins the race to the next write, the service can only handle one active request at a time, and it protects itself from being forced to handle additional work by telling clients to try again.

"Authorized Capacity" != "Available Capacity"

The service described previously has an "Authorized Capacity" of "1 in-flight request." If the computer running this service was a single-core tiny computer, it might not have enough power to serve additional requests (even the HTTP-429 replies) but most computing devices will have either sufficient speed (allownig for concurrent processing) or additional cores to handle concurrent requests. As a result, authorized capacity is less than the total available capacity of the server.

When running an unmodified, uncontained process in a computer, the process is "authorized" to use nearly the entire set of resources available to the OS.

In a cloud environment, a process can run inside a virtual machine (VM) where the constraints are similar to a bare-metal computer, but (with few exceptions) are executing within even higher capacity bare-metal hardware. This means the VM is authorized to operate in a higher-availability environment.

If running in a container or bsd-jail, the constraints are set by the container manager (sometimes just a command line argument!) then the kernel grants (and constrains) those resources to the process inside.

If a process exceeds its authorized limits in any of these environments, the supervisor process (OS, container manager, VM, etc) will forcibly end the offending process ("kill"/"terminate") and reclaim its resources. It doesn't matter whether or not additional resources exist that the process could use -- the supervisory system will act.

Conclusions

All of this leads to a few heuristics when defining the runtime environment for an application:

Determine the maximum resource usage for a given process, then add excess capacity to ensure the system has more than it needs by an appreciable margin. A heuristic is a minimum of 25-15% at lower values, and as little as 5% for larger environments (to avoid wasting significant capacity).
Stress-test / Load-test your application to its "maximum" throughput for the resources you want to use, then set your maximum in-flight to 90% of that value. This will ensure that your instance is always capable of processing the maximum in-flight within limits.

Both of these will ensure your authorized resources (for the application, your container, and your process) will never exceed your available capacity and keep your service functioning at maximum throughput in the face of overwhelming requests.

Check out Part 2 here

Consequences of Deviance

Tue, 18 Mar 2025 01:00:00 +0000

Published 2025-03-18

Wikipedia has a comprehensive definition of deviance:

Deviance or the sociology of deviance explores the actions or behaviors that violate social norms across formally enacted rules (e.g., crime) as well as informal violations of social norms (e.g., rejecting folkways and mores).

There's a lot of that happening, all the time, across the world. Everything, from something as small as someone being "rude" in a social situation, to someone violating the constitution of a country, counts as "deviance."

Where we have our key differences is in how we respond to it:

Although deviance may have a negative connotation, the violation of social norms is not always a negative action; positive deviation exists in some situations. Although a norm is violated, a behavior can still be classified as positive or acceptable.

Deviance is foundational to a good, well-functioning society. As Rise Against put it in their song lyrics, "You have to cross the line just to remember where it lays." If a law is written, or a rule proclaimed, and no one violates it, is the rule good? Does it do anything worthwhile for the society? If no one challenges a rule, does that make it a good rule? Is it useful?

No, any rule defined but unchallenged is an inconsequential rule. We need deviance at various levels to trigger our social immune-response to a violation. That response to violation of a rule is what defines the consequences of the rule and without consequences the rule doesn't matter. This is why people get upset when others hold them to account for their behavior - it's not the behavior that holds power, it's the response it elicits.

Similarly, if someone breaks a rule, and everyone says "yeah, good job" or just doesn't react to it... the rule is also inconsequential. The action-without-consequences means the norm doesn't apply any more.

So what?

Why does this matter? It's key to literally everything we've seen in the news, politics, and your local school/coffee/corporate-office discussions for the last forty years.

When people collectively decide "doing X is wrong" or "doing X is uncool" or "doing X isn't wrong any more", you're talking about modifying the social contract and moving the "bright lines" on what is considered "good behavior." This is completely normal and regular feedback mechanisms (reactions, opposition, enforcement, etc) are all signs of a well-oiled society.

Where it goes wrong is when a group (or even an individual) deviates from "normal" behavior and gets no feedback. As stated above, "no feedback" is "approval" when it comes to deviance. Or, in a more pithy way "silence is consent."

Again, so what?

One cannot ever be silent to deviations from norms. Reactions to such deviance depend on the severity of the violation.

Polite verbal response: "You probably shouldn't do that", "That's not polite."
Direct verbal response: "That is unacceptable", "Cut that out", "Have you no decency?"
Polite physical response: slapping someone, shoving them, as an attempt to "disrupt" their actions.
Direct physical responses: unambiguous violence, beatings, murder.

When people do not respond to the first or second level of "correction," violence is the only acceptable solution to violation of norms. We see this constantly in cases where the Talking Heads do and say things that are continuously violating norms, and yet, no one is able to give them corrections in real time. This is inherently degrading to our society and removes foundational social feedback from where it is most needed.

Conversely, when people do have the opportunity to give feedback and corrections in real time, you get situations like this (archive link). You can hear the cheers and boos as challenges are leveled at the representative. The representative also says "can you give me a chance to answer the question?" -- a clear call for the audience to adhere to Question-and-Answer norms, allowing the respondent to speak.

In other parts of the meeting (archive link), you can see attendees that "violate" the norms of behavior are being escorted out by law enforcement.

The problem with this situation is the power imbalance: attendees are indviduals (just like the representative) and yet their behavior is policed with force (i.e., physical removal) without actually reaching level 3/4 of "corrective action" toward the representative. The representative could have given a verbal response, but we never see whether that was effective. And, if the representative is wrong, they're still in a position to enforce their position against the corrective actions of an attendee.

Why We Fight

This is why real world, physical fighting is so effective. We can call physical-corrective-actions "riots" or other names to try and delegitimize the action, but ultimately it is the strongest legitimate corrective action available to a people when norms and rules are being violated.

So, when people tell you "violence is never the answer"? They haven't ever been punched in the face.

Wealth, Capitalism, and The Economy

Mon, 03 Mar 2025 01:00:00 +0000

Published 2025-03-04

Warning: Spicy Take to follow

I've only been alive after Reagan took office and systematically began dismantling the Federal government, so my lived-experiences are already fairly biased before I can even start talking about how "things used to be." Others have explained this (archive link) at length, so I'll defer to them on the background.

What I think is missing from today's understanding is that modern conservatives aren't "conservative" anymore. The republicans -- the group that is currently in power -- are deeply weird, unserious people, who couldn't be bothered to "perform their own research" if their lives depended on it. (And yes, actually, their lives do depend on it.)

What they've done instead is adopt the stance of your stereotypical 15-year-old American boy:

Remove all measures of transparency and openness
Claim as much authority and unilateral power as possible
Be deliberately, absurdly reductive and obtuse about things you don't like
Argue for nuance and empathy for things you do like, especially yourself
Act without regard for feedback, opinions, or justification

This isn't conservative or liberal. This isn't democrat or republican. This is willingly-ignorant, stupid even versus... every one and everything else.

How did we get here? Again, see the background link above, but I would like to take it one step further:

Every illiberal, non-progressive government and political entity is the cause of economic, bodily, and social harm.

You can't be a conservative and "do good" because the conservative ethos goes against the definition of "doing good things for people." To borrow a quote from someone else, the response to the Daily Wire and all these provocateur mouthpieces should have been:

"What is a woman?" Don't know and don't really care -- let's act in the way that causes the least suffering.

If that's not your stance, you aren't worth the skin you're in.

Switching to FreeBSD 14.2

Wed, 26 Feb 2025 01:00:00 +0000

Published 2025-02-26

Brief update from previous: the website is now being hosted on FreeBSD 14.2 (RELEASE) and Nginx (1.26+), instead of NetBSD 10.1 and Caddy (v2.9.1). Here's the dmesg

Why make such a change? I finally got to test the network and hardware performance of my system against a worth opponent (thanks, Ben) and found the server lacking. FreeBSD has received a lot of Netflix contributions to directly improve x86_64 performance, so, it stands to reason that FreeBSD will do better and be a bit more consistent.

Don't get me wrong, I love NetBSD and found it very clean, simple, and fully understandable, but I went off the deep end with packages and had too many ongoing performance quirks to not spend the time upgrading and fixing it.

Building "bulletproof" reliability

Wed, 15 Jan 2025 01:00:00 +0000

Published 2025-01-14

TL;DR

Automate your (personal) operations where you can. Don't rely on human memory or documents -- commit it to code and source control.
Always keep the resulting code up to date
Do not customize configuration unless such customization is expected for everyone who uses the tool. (e.g., SREs need a particular k8s cli config)
Rely exclusively on POSIX or tools you can test for (e.g., git) for maximum reliability.

Context

281 lines. All it took was 281 lines of code (329 if you include comments) to automate and make repeatable a process that originally took 1-2 days to complete each time it happened.

Provisioning and configuring your operating environment from a clean OS install is a very personal and custom experience for a person. You want to use your favorite tools, maybe set up your own custom keymaps (e.g., Capslock to Control), and make sure your favorite wallpaper or color scheme is available. Those parts are what make your experience "good" for you.

What the 281 lines of code did was not these things. The code took the most tedious parts of a 5-page-long new-hire onboarding document and automate the installation and configuration of required tooling. The script checks for existing installations and configurations and skips any work already done, so its execution is idempotent. It bootstraps the install environment (e.g. installing brew), installs required ancillary tools (like a kubernetes client configurator, git, java, etc.), ensures all $COMPANY internal configurations are present, and then starts executing additional package installations by using app-URLs to launch and install things through jamf and other managed platforms.

You still got to update your own tools, add your own aliases and shell scripts, but the basic bootstrapping needed for every SRE? "We took care of that for you."

Time and Effort Saved

When a new-hire joined the team, they used the script. It ensured they had the "basics" already available when they were asked to do a particular task. When a veteran team member had a broken configuration and was unable to do work, the script restored their "good working" tokens and configurations within a minute.

You might say that the time invested-and-saved is "worth it", or you might say that having to keep any of these documents or code up to date is "not worth it," and now you're stuck maintaining something you could just do on-demand before.

The problem is, this whole dichotomy of "[not] worth it" is completely untrue. Every company, every team, every individual, must operate and maintain their working environment.

You might argue "no, that's someone else's job." Fine, you've now externalized your needs. Doesn't change the fact that someone has to maintain it.

Maintenance

So if you have to maintain an environment, what's the most reliable option?

You could scribble down a document explaining how to install every tool. You could make every new-member work directly with a veteran and let the knowledge transfer happen orally and organically. You could have documents for every operation include "tools required," and leave it up to the executor to acquire them as-needed.

You could also write a $LANGUAGE program that digs deep and can provision your environment from nothing -- where the code is in full control. The problems with this approach include:

(Possibly) Learning a new-to-you language for implementation/bugfixes/etc.
Bootstrapping the language on a new platform (see: Rust, Clang, JVMs...)
Unique implementation details tied to language choice
A powerful programming environment may require otherwise optional libraries or functions your initial environment may not have

I argue none of these options are good, correct, or appropriate. The right solution is a POSIX shell script, because it's always there. You'll never be without a POSIX environment on Unix/Linux or MacOS. (Windows hates file-based configuration -- largely because they built APIs for mutation instead of file-oriented design.)

Writing a posix-exclusive shell script gets your environment and configuration set, regardless of how messed-up things might be.

Promised Updates

Fri, 08 Nov 2024 01:00:00 +0000

Published 2024-11-08

As I had promised almost a full year ago, here are some updates to how I'm using my T7810 workstation.

A few facts before I get into the opinions:

I run NetBSD (10-Stable) on this host. Here's the dmesg.
I have three drives on the host: two internal SSDs (Samsung EVO 870, 1TB and 4TB capacity), one external platter drive (WDPassport, 5TB).
I have added a 10Gbit Intel-compatible X550-T1 network card.
I have added a small usb-powered 7-inch display for diagnostic / crash-cart purposes.

Now, with all that said, having 24 cores and 48 threads even on a 10-year old chip has been more than fine. I don't need blazing single-threaded speed -- I need something where I can put a job and ensure that it will get done in a reasonable amonut of time. That could mean a git gc --aggressive on a multi-GB repository. That could mean doing a transcoding job on a video I recorded in 4k and intend to post. Even rebuilding the netbsd kernels and userland takes about 20 minutes from a standing-start. (If I'm building all the tooling from scratch as well, make it 40 minutes.)

Services active on this host include:

This website, backed by Caddy and PHP-FPM (for the RSS subdomain)
FreshRSS
mail sync (mbsync/isync)
Full cloud backups
IRC

I'm adding more self-hosted options as I get time and room to add things (like writing my blog publishing scripts).

To Do

For this host I am considering adding:

Archive In A Box
Samba / Bonjour / Media sharing to my home network
Mastodon hosting/node

Archive-in-a-box is to give me a "personal archive.org" so I don't lose websites or data if it gets rugpulled. I'm too worried about data getting lost for me to not spend the time and effort to capture what's available. On the plus side, I think the service lets me upload to the real archive.org as well, preserving an off-site backup.

Sharing on my network is just so I'm not beholden to Apple-iCloud or any other off-site service with my file-based data. I have plenty of storage I pay for and I can always add more at home for "less critical" temporary storage, like video files before they get reprocessed/cut.

Finally, might be fun to have a mastodon node/host on my own domain for myself and my friends. Not sure how much effort that will be.

November sucks

Wed, 06 Nov 2024 01:00:00 +0000

Published 2024-11-06

I'll just say it -- November, as a month, sucks approximately every 3-4 years. Doesn't matter who gets elected, we end up with "winners" and "losers" and a full half (sometimes more, sometimes less) of the US feels like the entire world is going to collapse.

Yes, our world is in dire peril. Yes, we have leaders who are making extremely short-sighted decisions that affect individuals' life-and-death stakes daily. When we do get bills and laws passed to make changes, the supreme court... makes incredibly opaque rulings based on very little established law and perspectives that make law professors cringe or, at the least, scratch their heads.

All of this matters. All of this is important. What I'm trying to remind myself of, is that we all cannot vote once every four years and expect things to Just Work. We must be politically active, all the time. We must constantly fight for the betterment of our jobs, our personal rights, our local, state, and federal representation -- all of it. It's exhausting, but it's the only way to make it work. Looking at leaders like Jennifer Briney, or Molly White, or the 5-4 podcast, I'm going to start spending more time reading and understanding laws-as-written, writing websites to track these sorts of things, and generally investing more in building awareness for myself (and, if I can, everyone else).

All that said, you might lose friends over your politics. You might have to re-evaluate who you work for. You may need to re-evalute what party / parties you belong to, where you live, what roles you hold in your groups, all of it.

None of this is a Bad Thing. We are social creatures and, twisting the 1960s second-wave-feminism quote, "the political is personal." For all that people want to "keep politics out of it," when your existence and your health and your freedom are on the line, there cannot be a divide between the personal and the political.

Relevant quotes for inspiration:

Richard K. Morgan in the Altered Carbon series:

The personal, as everyone’s so fucking fond of saying, is political. So if some idiot politician, some power player, tries to execute policies that harm you or those you care about, take it personally. Get angry. The Machinery of Justice will not serve you here – it is slow and cold, and it is theirs, hardware and soft-. Only the little people suffer at the hands of Justice; the creatures of power slide from under it with a wink and a grin. If you want justice, you will have to claw it from them. Make it personal. Do as much damage as you can. Get your message across. That way, you stand a better chance of being taken seriously next time. Of being considered dangerous. And make no mistake about this: being taken seriously, being considered dangerous marks the difference - the only difference in their eyes - between players and little people. Players they will make deals with. Little people they liquidate. And time and again they cream your liquidation, your displacement, your torture and brutal execution with the ultimate insult that it’s just business, it’s politics, it’s the way of the world, it’s a tough life and that it’s nothing personal. Well, fuck them. Make it personal.

Face the facts. Then act on them. It’s the only mantra I know, the only doctrine I have to offer you, and it’s harder than you’d think, because I swear humans seem hardwired to do anything but. Face the facts. Don’t pray, don’t wish, don’t buy into centuries-old dogma and dead rhetoric. Don’t give in to your conditioning or your visions or your fucked-up sense of ... whatever. Face the facts. Then act.

Every previous revolutionary movement in human history has made the same basic mistake. They’ve all seen power as a static apparatus, as a structure. And it’s not. It’s a dynamic, a flow system with two possible tendencies. Power either accumulates, or it diffuses through the system. In most societies, it’s in accumulative mode, and most revolutionary movements are only really interested in reconstituting the accumulation in a new location. A genuine revolution has to reverse the flow. And no one ever does that, because they’re all too fucking scared of losing their conning tower moment in the historical process. If you tear down one agglutinative power dynamic and put another one in its place, you’ve changed nothing. You’re not going to solve any of that society’s problems, they’ll just reemerge at a new angle. You’ve got to set up the nanotech that will deal with the problems on its own. You’ve got to build the structures that allow for diffusion of power, not re-grouping. Accountability, democratic access, systems of constituted rights, education in the use of political infrastructure.

From Peter Watts' blog:

Edmund Burke once said that the only thing necessary for the triumph of evil is for good men to do nothing. I think that begs a question.

If you do nothing, what makes you any fucking good?

New Hardware

Wed, 29 Nov 2023 01:00:00 +0000

Published 2023-11-29

Intentions of "writing more" aside, I have ended up buying a few new pieces of hardware. A while back my 16GB M1 Macbook Air threw multiple "out of memory" errors in a week, so I traded it in for a M2 Max Macbook Pro with 96GB of RAM (the most it supports). Since then, the extra power of the laptop has made the coding projects where I use my i5-CoffeLake Dell 3070 (a SFF box with 32G of RAM) seem pretty small. I've been reading up on buying a quarter-rack (12U) and looking in to how expensive some Dell R710 and R6XX series 1U servers would be, but between the cost of the rack ($600+) and the noise of the 1U and 2U servers, it seemed I was going to be "stuck" with either using my laptop, living with an underpowered machine, or building my own out of new parts at significant cost.

Then I found this Dell T7810 refurbished on Amazon's Renewed store for $370. It has 2 x E5-2680v3 Xeon processors for a total of 24 cores and 48 threads, 128GB of ECC RAM (with eight slots total, up to 256GB for around $410), and comes with a 480GB SSD and an Nvidia Quadro card. Considering my main issue with the SFF machine was CPU power, I'd say this resolves many of my concerns around having "enough" crunching capacity on the side.

If I find myself in need of some added storage, the board supports PCI-E NVME drives, so I can throw in a 4TB disk and an adapter for ~$200 total.

It is scheduled to arrive between today and Friday (Dec 1), so I'll be putting up specs and notes after it gets booted.

Reboot

Fri, 13 May 2022 01:00:00 +0000

Published 2022-05-13

This journal stopped because I felt like I didn't need to write any more. I joined a new company and decided to wait before writing any technical entries. Then, I was granted a management position at my company and began to remove myself from technical contributions. Most of what I had to deal with was related to personnel matters and so could not be shared in a public journal anyway. By the time I "had the time" as a manager, I'd lost all interest in writing.

Now, I've moved to a new company. My job includes writing technical decision documents, architecting solutions, and giving opinions to guide engineering and business teams. So why not resurrect the journal?

I've got a few topics already queued and will be publishing in the coming weeks.

The Pace of Change

Sun, 22 Jul 2018 01:00:00 +0000

Published 2018-07-22

One of the first subjects my law professor discussed with the class is the nature of change. We, as a society, expect ever faster change in our world through additions like cars, cell phones, the internet, and ever continuing development of purpose and meaning around those things: privacy laws, seat-belt laws, regulation and testing, etc. This leads to the "superstructure" changing faster than the infrastructure.

The problem with this ever-faster pace of evolution is that the law, our norms, and our generational preferences are part of the infrastrucutre. While technology is unmoored from the limits of certain things like social convention, we still have people who will physically assault someone for wearing Google Glass or recording them via phone. While this may have been an issue as early as 2008/2009, we have states establishing two-party consent requirements as early as 2010, and by 2017 we have a constitutional rights to record police. It took ten years for the law to catch up to the proliferation of portable, high-quality recording devices, and even then our societal norms haven't caught up. Fifty years ago, Future Shock shook the world with its predictions of radical change and evolving technology, which is also addressed in this NPR article.

As technology and our world evolves, much will change and we must change with it. One cannot stick their head in the sand, say "I disagree," and simply expect the world to adapt. Neither nature, expanding geopolitics, nor technology will care, but with time and effort maybe we can enshrine more cosmopolitan understanding and norms into our laws.

Google Follow-up

Fri, 16 Mar 2018 01:00:00 +0000

Published 2018-03-16

A lot of people with whom I've shared my journal entry were confused as to why I would rage-quit Google entirely when I was upset with only one piece of the company.

Normally I wouldn't. Normally, I would just pivot my usage over to another provider, stop using the offending service, and get on with life. What's different now is that rage-quitting is often the only way to get traction on much of anything these days. If I didn't rage-quit, nobody would respond, nobody would notice.

Google's divisions care about their bottom line and their publicity. By publicly saying they are terrible, I get the opportunity to get some kind of response from Google. That didn't happen here, but then again, Google doesn't really care about individuals unless they're huge influencers, in which case they'd also like to hire you...

The short of it is, we still have enough alternatives to choose a different option. Until we get to the point where companies aren't allowed to have such significant market power (again), companies will end up reaching a size where they don't care about their detractors and stop innovating. Hence Google.

I'll leave you with a quote. It's more political, but everything is getting political these days:

“The personal, as everyone’s so fucking fond of saying, is political. So if some idiot politician, some power player, tries to execute policies that harm you or those you care about, take it personally. Get angry. The Machinery of Justice will not serve you here – it is slow and cold, and it is theirs, hardware and soft-. Only the little people suffer at the hands of Justice; the creatures of power slide from under it with a wink and a grin. If you want justice, you will have to claw it from them. Make it personal. Do as much damage as you can. Get your message across. That way, you stand a better chance of being taken seriously next time. Of being considered dangerous. And make no mistake about this: being taken seriously, being considered dangerous marks the difference - the only difference in their eyes - between players and little people. Players they will make deals with. Little people they liquidate. And time and again they cream your liquidation, your displacement, your torture and brutal execution with the ultimate insult that it’s just business, it’s politics, it’s the way of the world, it’s a tough life and that it’s nothing personal. Well, fuck them. Make it personal.

-Quellcrist Falconer Things I Should Have Learned by Now, Volume II”

- Richard K. Morgan

Google is terrible

Tue, 27 Feb 2018 01:00:00 +0000

Published 2018-02-26

I'll admit it -- I've been a Google fanboy for a long while. They have really good web-interfaces, they work at providing consistent APIs for their products (I'm lookin' at you, AWS), and their products generally work as-advertised. They even had that motto: "Don't be evil"

My main complaint now is with their utter lack of customer service and support. I had heard anecdotes that they weren't the best about supporting their customers (even when paid for said support). I had experienced some of that "bad support" myself, but I dismissed those reports and my own mistreatment with the assertion that one doesn't need support with good documentation. Well, I was wrong.

Google is a phone service provider. They are registered with the other TelCos as a "prepaid" system, in that you pay ahead for each month of service. They also have a "device protection" plan to permit replacing a broken phone at a reduced cost, available within 30 days of device purchase.

The problem with Google is that, even when you trade in or purchase a new phone, you are still charged for device protection on the old phone. There is no notice that you will continue paying $5 per month ($60/year) on a service that you can't use, nor is there any option to transfer said coverage to your new phone, even if you've been paying on a different device up to this moment. Never mind the cost to the customer is the same (five US dollars), you've been paying on the wrong device, so you don't get device protection.

For what it's worth, I would have shrugged this off if it weren't for the Fi Support team being inconsistent in their own messaging. They can't even agree with themselves!

This is the last straw. Google was supposed to be better. Google was better ten years ago. Since then, they've turned into the worst parts of themselves and pursued a business venture that lags far behind companies that actually want to retain their customers.

Now, I'm migrating my customers and personal hosting to other providers like Azure, moving my phone plans to T-Mobile, and doing my dead-level best to get off Google's products in their entirety (including Gmail and Drive). In a professional context, I'm advocating for using cloud providers like Azure or AWS, mail from Outlook.com, and Office365.

Google, you've failed at being a company we can root for.

Docker Shenanigans

Fri, 03 Nov 2017 01:00:00 +0000

Published 2017-11-03

I've been working with docker since v1.6.0 was "new." I like docker. I think it has many good qualities. What I don't appreciate is how badly we developers, devops, and ops-persons abuse it.

Docker is not a VM

A virtual machine has virtual hardware, an internal operating system (which can be entirely different from the host), and runs the entire Operating System including its own memory management, software updates, user accounts, etc.

Docker containers, by contrast, share a kernel and otherwise acts like a FreeBSD/Solaris jail for the process(es) inside, with a single user: root.

Docker leverages multiple technologies, acting as a frontend: kernel control groups (cgroups), Overlay (COW) Filesystems, and kernel namespaces. As such, the host configuration (e.g. kernel version) matters significantly more to a docker host than a VM host. This means Docker can pack more containers on a single host than most hypervisors can pack VMs, with dramatically less overhead per container.

Stop making oversized docker images

Docker containers are designed such that the application defined in the CMD or ENTRYPOINT directive is the init process for the container. The technology behind docker (above) means you need just enough to run the intended application. For most applications, that "just enough" can be as little as a single binary, or a dozen supporting libraries, or a JVM installation.

In order to make this "just enough" container, there's a specific method called the Builder Pattern that leverages container impermanence to build an application with minimal dependencies in the final product image. It's so powerful, you don't even need the language tools installed on your machine to build!

Make your production (deployment) containers as small as you reasonably can. It makes everything easier.

Stop trying to "hack" docker to make it go faster

Fix your architecture / code instead

Docker is not a panacea. It enables some very useful and secure optimizations for developers and operations/infrastructure teams alike. Its general-purpose design means it works "by default" for most architectures and applications as well.

If you've done everything to optimize the container, and moving an application into docker causes it to perform worse than when running natively, take a long look at the code and its architecture. Even traditional monoliths like mysql & postgresql perform well in containers.

Docker Builder Pattern

Thu, 02 Nov 2017 01:00:00 +0000

Published 2017-11-02

The Docker Builder Pattern is a highly useful pattern for leveraging docker containers to generate artifacts and then package those artifacts in a runtime-only image. For languages which produce binaries or single-data archives (e.g. Go, Java/JVM languages, Rust, etc), this pattern minimizes production container sizes, accelerating deployment while reducing incidents of broken dependencies, conflicting build libraries, and permits centralized control of build tools.

Note, this does not apply to interpreted languages such as Python, Ruby, or NodeJS, though similar results can be achieved in NodeJS with webpack and other preprocessing measures.

Examples below include both Windows and Linux/OSX equivalent commands.

Example Problem

Check out the code from https://github.com/tydavis/hello-world-docker and make sure you have Docker installed and running.

Build the Binary

If you don't have the Go compiler installed, don't worry! You're not going to need it.

Make sure your shell is in the hello-world-docker directory and execute the following command:

For both Linux and Windows (Powershell)

docker run -v ${PWD}:/go/src/github.com/tydavis/hello-world-docker \
-w /go/src/github.com/tydavis/hello-world-docker -it golang:alpine \
/bin/sh -c "CGO_ENABLED=0 go build "

Let me walk you through this command:

"-v …" binds the current working directory to the relevant location inside the container (the part after the colon : )
"-w …" sets the container's current working directory to your mounted directory
"-it" grants you an interactive terminal connection
"golang:alpine" is the latest set of Go compiler and utilities built on top of Alpine Linux. Since we don't need extra utilities like the race detector or other glibc-exclusives, it's a safe choice.
/bin/sh -c "CGO\_ENABLED=0 go build" -- this command disables dynamic linking, which creates a statically-linked binary.

What did we actually do? We just created a Go binary without installing anything to our machine.

Build the Docker image

Now, with the binary created, it's just another file to docker. We can build our "production" image with:

docker build -t hello-world-docker:1 .

If you dig into the Dockerfile, you'll see that we start with the scratch layer. That means the only thing in this container is the binary. Let's look at the image size:

tydavis@utils:~/go/src/github.com/tydavis/hello-world-docker$ docker images
REPOSITORY           TAG                 IMAGE ID            CREATED         SIZE
hello-world-docker   1                   fc2081dda00b        10 seconds ago  2.03MB
golang               alpine              6e8378057093        7 days ago      269MB
tydavis@utils:~/go/src/github.com/tydavis/hello-world-docker$ du -h hello-world-docker
2.0M    hello-world-docker

What about Multi-Stage Builds

Multi-stage builds take the wrong approach.

Without using mount-points, users have been ADDing or COPYing their entire codebase into the container image via a docker build command, then using docker cp to extract the resulting artifacts. For languages which emit artifacts (binaries, or single archives like JARs), copying code into a container is fundamentally flawed.

The multi-stage concept takes this further down the "wrong" path, encouraging this same copy-code-into-image mindset and providing an unnecessary function to discard the image inline during build process. As demonstrated above, one does not have to build or modify the tools/compiler container every time, meaning artifact-build-time is significantly faster than the multi-stage process, even with build layer caching.

Conclusion

If one is using a language that permits generating artifacts (Go[lang], Java/JVM languages, C/++, etc) then copying code into the image will unnecessarily bloat the result. One should be using the builder pattern instead.

Conversely, if using something like Python, Ruby, or another interpreted language, then copying into the image may be the only solution due to runtime environment requirements.

Learning Rust

Tue, 24 Oct 2017 01:00:00 +0000

Published 2017-10-24

I'm in the process of learning Rust to build my systems-language skills. While at first my brain outright rejected the format and design of rust, I am now working my way through Rust By Example and the results are dramatically different. Finally, it's starting to make sense, and the benefits of the language design are becoming more apparent (especially from my perspective as an SRE).

I've been learning Go for the last few years, which makes Rust an immensely foreign design at first. Let? Mut? Traits? So much syntactic sugar… where are my goroutines?! Go was designed to fix long compile times, threading problems, and in general provide an "easy onboarding experience" for developers through simple design and having a lot of "batteries included" libraries. Rust, on the other hand, is trying to fix the problems with C/C++ as a systems language, focusing on code safety as a top concern.

I haven't yet built anything using Rust -- I still fall back to Go first -- but I do find ripgrep and fd are two incredibly helpful (and crazy-fast) programs worth installing everywhere. Once I start building my own solutions in Rust, I'll write a follow-up post with better insights.

Why clean up git branches

Thu, 07 Sep 2017 01:00:00 +0000

Published 2017-09-07

Problem

When encountering issues like:

Cloning repository...
Cloning into '/builds/username/repo_name'...
fatal: pack is corrupted (SHA1 mismatch)
fatal: index-pack failed
ERROR: Job failed: exit code 1

Solution

Run git gc --aggressive and/or git repack -a -d -f --depth=50 --window=250 on the origin repository. If you are using a system like Gitlab / GitlabCI, manually trigger the Housekeeping routine.

Background

Central repositories can have git index corruption and other slowness operations due to fragmentation and high numbers of objects in a git repository. By deleting merged branches, using only lightweight tags for references (instead of, say release branches), and squashing merges into a single commit, we reduce the total number of objects git is required to manage.

Normally, git will automatically collect unused references, but the incremental GC operations don't regularly go back and rebuild or repack the repository data. These manual invocations of git gc/repack will resolve the issue.

Ignorance costs us everything

Tue, 22 Aug 2017 01:00:00 +0000

=============================

Published 2017-08-22

There are a great deal of things I wish I had known when I was younger. What's worse is when I forget lessons I've already learned...

Manners cost you nothing but ignorance will cost you everything.

-- Ryan Swain

Being polite to others is easy when you prioritize it. Rudeness ruins me if I am not mentally present, when I let my brain stem take over my reactions, and this has hurt me ever since I thought I was free to act without thinking.

Somewhere around the time I went off to college, I thought I had "found my people," that I was able to relax and "just be myself." The truth is, I hadn't. I had found people who weren't willing to hold me to a higher standard of interaction, and expected me to relate at their level. So I did.

Years passed before reality was shoved in my face: my emotions, my unthinking actions, the inertia I was carrying from those college years forward was getting in my way. I was too concerned with what I thought was right, with "following the rules," and ignored the truth of customer service, which caused me to be passed over for a promotion. After wrestling with the source of my failure, I firmly sat on my ego and made my focus the customer - damn the rules. Not long after that I was rapidly, suddenly successful -- I had my promotion, with top marks from customers, and was regarded as an expert in my specialty. I would repeat this kind of mistake whenever I thought my main contributions were technical and not social.

I moved on to another company where they had significant technical challenges, but where I delivered was when I provided work which a customer needed badly. I'll never forget the day I reworked an otherwise neglected script and turned over the result to our account manager, with instructions on how the script could be adapted for other clients -- last I heard, it was in use two years later by every one of the top 20 clients.

Moving on to yet another company, I was thrown in the deep end with technology I had never worked with or seen. After six months of focused work on Kubernetes and our use-cases, I had nearly every technical aspect dialed-in and backed by research. Yet my role fell apart in relating to others.

I delivered some changes by fiat, argued emotionally, vilified any opposition to my plans, and refused to examine those across the table from me… These were not my proudest moments and I was at least partly aware of how bad the situation was. So, I took opportunities to attend Meetups, to deliver presentations, and share stories of our success, thinking that would force a better change in myself. It didn't, but I wouldn't be aware of that until my next job.

Today, I am in a position where my job is, unofficially, an experiment. This role has no power unto itself. I don't even work on the same Operations tickets as my peers, despite being an SRE in title. I'm basically Switzerland, mediating between and negotiating with two groups that sometimes treat each other aggressively, and sometimes don't even agree on the same use of language.

Here is where the social aspects are always apparent. Here is where my standout contributions are, again, social rather than technical. I cannot forget how important that is.

Businesses are inherently social. The solitary, asocial Creator-Person is a myth and the mythology is powerful in the tech sector. We have been misled by the mythology by perpetuating it, and I have forgotten, repeatedly, that the technical parts are the least of what we have to contribute.

I am never free to act without thinking. Manners, politeness, and a helping of tactical empathy make the impossible possible. I must remember the lessons of the past, think before I speak or act, and focus on what works.

It's all politics

Tue, 15 Aug 2017 01:00:00 +0000

Published 2017-08-15

It's really hard to watch the news unfold lately. I find the Charlottesville protests (is it too soon to call them hate groups/crimes?) absolutely abhorrent. I've talked about the right to privacy on the internet, but in cases like these, it's entirely different: your right to bigoted, racist beliefs ends when you pick up a sign, gather publicly, or otherwise impinge on the rights of others. In short, it's the Act that is censured, not the belief. And when it comes to speaking publicly, that includes Twitter/Facebook.

We tend to dismiss the harsh words of an outgroup because "they are not us," and rely more on what we see with our own ingroups. If we don't see someone in our group being an asshole, and our ingroup doesn't confirm it, we tend to not believe it. That impulse has served us well for most of our history, minimizing enemy propaganda efforts and promoting group cohesion, but this resistance to turning on one's own also gives some measure of protection to the deviants of any social group: as long as they don't deviate around their ingroup, they have more freedom to deviate among those outside it.

During my studies in sociology, we reviewed the case of two groups of teenagers in a small town: the skaters and the prep-boys. The skaters were seen as a nuisance and deviants by the townspeople because they would hang around parks, occasionally vandalize a business or part of their school, and would "exhibit deviant behavior" as teenagers often do. On the other hand, the prep-boys were seen as "good kids" by the townspeople, despite causing more damage and vandalism. What was the difference? The "good kids" had cars and so would drive to another town, isolating their vandalism to a place 30-plus minutes away, before returning home. There was no difference in the group members except that one had the means to travel and be relatively anonymous, while the other did not.

This ability to be anonymous is also not new in society, as one could always manage to travel when motivated, but temporary anonymity only came about during the Boomer generation with cheap cars and nationwide infrastructure. When considered together, participating in a protest hundreds or thousands of miles away seems like a relatively anonymous activity. One might think "what happens here stays here" i.e. nothing I do abroad will affect my life at home, so I can do what I want and not suffer the long term consequences. Thanks to the ubiquity of cellphones-with-cameras and the rapid propagation of information on social media networks, deviant behavior anywhere is no longer an anonymous activity.

Having lived in small(er) towns for most of my childhood and young-adult life, I know very well how disruptive, invasive, and anxiety-inducing it can be when everyone you see knows your business. You have to be constantly aware of the group mindset, policing your words and actions accordingly, or your livelihood is gone. What simultaneously scares and excites me is that this small-town mindset is now becoming a reality for the entire nation.

White supremacy and other antisocial, intolerable mindsets are being exposed because Trump and his administration have emboldened them. They now feel that they don't have to hide. That's not reality, and the fact that these people are being identified and fired is evidence of that. As powerful as this activity is, it's also the definition of "tyranny of the masses." In this case, I argue its actions are morally and socially just -- they are attacking those who are breaking the social contract of tolerance -- but when this mechanism doesn't have a ready-made target on which to focus, it can (and likely will) go devastatingly wrong. Unchecked, people who have merely different views will be subject to the same scrutiny and pressure as those who have completely antisocial views, and their lives will be ruined just as badly. A great deal of self-control will be needed and, thankfully, I see that already affecting subsequent identification activities. It seems the "masses" are only so vicious because of the subject, and it is awe-inspiring.

At best, this "rise of white supremacy" should be seen as an extinction burst, a final gasp as society purges the groups which actively violate the social contract before turning to a socially-enforced state of progressive liberalism. This is my hope and dream. The next question will be whether or not our elections will bear out a change of seats for our legislature in the midterms.

Update

Eugene Volokh wrote an article about the limitations of employers, and specifically how employees can't be fired for acting politically outside of work (in some states): Can private employers fire employees for going to a white supremacist rally?

For those who don't have a subscription, here's a PDF.

Your Ops team is dying

Thu, 03 Aug 2017 01:00:00 +0000

Published 2017-08-03

Operations teams which don't evolve are dying off, and that's a good thing.

Infrastructure is a matter of services, service providers, and your budget

Most companies don't buy actual hardware servers these days. Cloud Providers make it easy for services to leverage the provider's economies of scale and rent compute time and services. As such, your Operations team isn't building networks, hooking up servers, or doing anything except laying down the services on which your business will run. Also, teams will take advantage of managed services like on-event functions (Cloud Functions, AWS Lambda), or databases, meaning there will be less for your operations team to worry about.

Developers must be responsible for managing and deploying their code

Developers know their code, and how to scale it. Without a lot of documentation or runbooks, the Operations team knows little beyond "turn it off and on again" to resolve a 3am pager alert.

When you empower your Developers, they also take those responsibilities seriously. This article makes some really good points about this very issue, and how Operations can't necessarily handle the code-related bug in a service.

Focus on Simplicity and small teams for fast feature deployment

Operations usually scales with the rate of Developers, but it's a matter of ratios: some shops have a 10:1 rate of Developers to Operations personnel, while others reach 50:1 or higher.

I've personally worked on a team where the ratio was 60:1. After six months of work at (re)building the infrastructure, providing documentation, and campaigning for developers to be their own tier-1 on-call, I was moved to Tier-2 on-call. For the next year, I didn't get a single pager duty alert, despite being on-call for most of it.

If teams can be kept small and the architecture kept as simple as possible, shipping features and resolving bugs will be inherently faster than any large-scale organization structure. The (possibly apocryphal) Amazon "two pizza" limit applies here: if you can't feed the entire team with an order of two large pizzas (16 slices), the team is too big.

Recap

Operations teams need to focus on infrastructure and applications which let Developers create, deploy, and scale their designs without involvement from the Operations team. At best, the Operations team will be invisible -- no one interacts with them, the infrastructure doesn't fail, and the architecture is simple enough to understand and exploit without calling them in.

When the Operations team builds the right kind of infrastructure, you don't need a ratio of 10:1 -- you have a team small enough to be fed by one pizza or, ideally, no pizza. Your developer teams leverage provider-optimized services, keep the architecture simple, and understand the weak points of the architecture just like your Operations team did...

Post Script

For those who might think this is alarmist, or oversimplifying the role of an Operations team in an organization, let me clarify: this is my job. I've been working in an operations role for over a decade and seen radical efficiency and delivery (regardless of Product success) where companies follow this mindset. When companies have large-scale operations and IT organizations, that's when inefficiencies, delays, and the "ops vs devs" culture wars start. The future is having a group of Engineers, a focus on simplicity, and a problem to solve.

Google Sites

Tue, 01 Aug 2017 01:00:00 +0000

Published 2017-08-01

I've made so many changes to this blog/website that this should scarcely be a surprise for anyone any more.

While normally I would be serving off of my "naked domain" (gluecode.net) I instead elected to redirect the root of my blog to Google Sites and forego purchasing the GSuite for my domain. Reasons for this include:

Transferring data and information from tydavis@gmail.com would have been difficult or impossible (including Play Store, purchased storage space, and ~20GB of data migration).
Naked domain setup might be possible using basic domains.google.com records.

While I said I'd much rather "roll up my sleeves" and present exactly what I want, this isn't markedly worse than my original work. It adapts to mobile display without issue. I can embed maps, calendars, and youtube links at will. I can add Google objects (Docs, Slides, etc) and regardless it all reads cleanly on the screen.

I'm sure that I'll find something wrong with Sites in the future, but for now, I'm going to keep watching Gophercon videos like the one below.

Gophercon 2017 and other stuff

Mon, 24 Jul 2017 01:00:00 +0000

Published 2017-07-24

Keeping this post a bit short. Gophercon 2017 has released its videos on Youtube in a convenient playlist. I'm about 7 videos in and I have to say the talks are fantastic. Definitely worth watching.

Also, in the interest of keeping private things private, I have moved this blog and other repositories into my private Gitlab account. I'll be leaving only my most public and reasonably curated things on Github.

Finally, don't underestimate the power of taking a break. This last weekend my wife and I spent an overnight in Twin Harbors State Park, with easy access to the beach. While my back and sides have protested the mat on which I slept, my brain was on when I got back to the house. From that I've been able to write better, code more clearly and directly, and exert more willpower.

PADI Scuba Open Water certified

Mon, 10 Jul 2017 01:00:00 +0000

Published 2017-07-10

I finally got Open Water Certified with Seattle Scuba. The instructors and divemasters were amazing, and I highly recommend that people who love to explore and adventure give it a try. At the very least, you have a whole new respect for those divers you see in the movies!

For what it's worth, I've been bitten by the diving bug. All it took was the final dive. After doing all the skills and confirming I can handle myself through all the basic scenarios, we buddied-up and investigated the bottom of Cove 1 in West Seattle (just off to the left of Salty's on Alki beach). It's crab season, and there is so much down there to see...

After having a chat with my wife, we've agreed to get our Night Diver, Dry Suit, and Advanced Open Water training. It's just a matter of "when!"

Chromebook has arrived

Wed, 14 Jun 2017 01:00:00 +0000

Published 2017-06-14

With what I can only describe as an obscene amount of excitement, I've unwrapped my new Samsung Chromebook Plus and it's everything I could have asked for. Typing is fluid and feels like it will retain that crispness, the display is ridiculously bright, and even my weird "use the search key as a control key" setting has migrated over to this baby.

Oh yes, I'm writing this new post on the device. It's definitely in a position to become my daily-driver (minus the times when work requires that I use their hardware).

Some negatives at this time include:

The keyboard is ever so slightly cramped compared to the rMBP, but that's due to the 3:2 aspect ratio.
The screen is just so damned BRIGHT it burns my eyeballs a little
Chunky webpages like Inbox and Slack are taking just a few seconds to load and "get smooth," but admittedly this happens to my quad-core i7 laptop, and to my rMBP, so no real change there. (Ok, I was hoping for a little more out of this OP1 processor, but who am I kidding...)

Overall I believe I am and will remain extremely pleased with this purchase, especially once I finish my Coursera classes and can get to work on more development projects!

Final Chromebook Prep

Tue, 13 Jun 2017 01:00:00 +0000

Published 2017-06-13

In preparation for the arrival of my new chromebook I'm doing a final review and backup of my data. Not surprisingly, it's all been backed up and current for months.

The downside to this is that I need to sell my existing laptop and start removing my personal items from my work laptop. Some could argue that I shouldn't have had personal data on my work-provided laptop anyway, but the reality is that the lines between work and home have been blurred by companies and employees alike for decades.

That means it's time to sit down, take a long, deep dive into my machines, and see what needs to go where. I think the hardest part is going to be building up another server with utilities -- in my case, I need Libreoffice (or something like it) that can reliably read and reformat DOCX files into Word97-compatible DOC formats. Why? Because Google Drive appears to have lost its touted and much-valued Word-conversion abilities, and my online courses don't know how to build/write documents in a more compatible format.

Alright, enough yapping. Time to get to work.

Ranting about these Whole30 articles

Wed, 17 May 2017 01:00:00 +0000

Published 2017-05-17

I am sick and tired of articles like this one from Glamour.com.

"I did Whole30, but I really didn't because I half-assed the program and then completely gave up when my family visited. Now I've invented the Whole15!"

Ok, that doesn't quite convey the reality of the issue, so let me share with you her own words:

... If you half-ass it, they say, it doesn't count.

But I'm going to be honest with you now: I half-assed it here and there. Almond milk is very gross, and I quickly returned to soy as my preferred morning latte ingredient, bean ban be damned. I didn't ask at restaurants whether my piece of fish was cooked in butter. And if a lentil found its way into my veggie mix, who was I to remove it? I really did leave dairy, sugar (including my beloved Diet Coke), and gluten behind, but I refused to stick to a "healthy" eating plan that didn't permit rice cakes and peanut butter. If rice cakes and peanut butter are wrong, I do not wish to be right.

Let's get a few things straight:

Eating "approximately 1,000 Larabars at every meal" fails the spirit/intent of Whole30. That alone fails you out of the program.
Eating rice, peanut butter, or any amount of soy automatically fails the program.
Bottom line: You failed the Whole30 and you did not complete a Whole9 or even a Whole15.
You didn't "invent" the Whole15. That's been around for a while, and you didn't do that either.

Why am I emphasizing the fact that you failed, despite your admission of failure in your very first paragraph? Because you were still incorrect when you said you made it 15 days. Your program ended the second you put soy creamer in your coffee, or ate a buttered fish, or let a little lentil into your salad. When did those things happen? What was it, eight or nine days in? You never said so in the article.

When it comes to doing a Whole15, you plan to end your program at the appropriate time, then you follow through. Even Melissa Hartwig herself said you can complete shortened versions of whatever length you think appropriate:

"If you've done at least three full Whole30 programs, been working hard to maintain your new healthy habits, and just need a little reminder/pick me up/kick in the pants, a shortened version of the Whole30 may be just the ticket to help you accomplish some short-term health and habit goals." -Melissa Hartwig

Instead, you decided to rationalize the program failure and claim you did a "Whole 15." Bullshit. Anybody who "modifies" the program or decides they can have "just a little" isn't doing the program. There are no cheats, there are no exceptions, and you don't get to say you "did it halfway" if you violate those rules. You aren't allowed to claim you did anything remotely "Whole 30" if you change the program.

Elizabeth Logan, you and those like you have lied and are deliberately deceiving the rest of the world with your lie.

Stop it.

Blog on a diet

Tue, 16 May 2017 01:00:00 +0000

Published 2017-05-05

Every time I think I've found a suitable means of running my blog (e.g. Hugo, Ghost, etc), I end up reading or reviewing something else compelling enough to warrant implementation.

I'm an optimizer by trade and by desire -- I love efficiency and get a thrill out of playing the bicycle-riding hippie to everyone else's SUV-driver. When it comes to this blog, I've been looking for the simple, easy, and fast implementation that also delivers just what I want to me (and any incidental readers -- Thank you!). Since I started moving to a Chromebook, I've had to rule out local static site generators like Hugo and Pelican, and because I want to use my own domain with TLS certificates (particularly those from Let's Encrypt) and the latest HTTP2 technology, I have to run my own server with software like Caddy.

Practically everyone I know swears by Markdown format (which requires a parser/generator to create the real HTML) but it looked easy. Even the uncool Enterprise kids were doing it, and so I tried it. The hard part was that I knew HTML already and ended up googling syntax more than I wanted, and nothing ever came out 100% as I imagined it would in my head.

Fast-forward a couple of years. About a month ago, I read this talk by Maciej Cegłowski. This was it -- I had found my answer, and it was "roll up your sleeves and get to work."

Much like my use of Golang, Perl, and Linux, I'm a fan of anything you can produce or make better with mere effort. In this case, writing HTML was a little work, but it got me exactly what I wanted.

Life in Chrome (Part 2)

Tue, 04 Apr 2017 01:00:00 +0000

Published 2017-04-04

It has been over a month since I've switched to using Chrome for all of my work and I think I've found my happy place.

Cloud9 is a fantastic editor and, by following C9's official directions, I've been able to set up the C9 IDE for both work and home. No, I am not giving c9.io access to my work machine. Yes, it is awesome to be able to call up a complete editor+terminal without a bunch of SSH keys. And the vim/emacs keybindings are spot-on.

I still haven't resolved my DNS concerns yet, but HTTPSEverywhere and uBlock Origin are doing a bang-up job of keeping my connections secured and cutting out the ads. I think what I'll need to do for my home is set up a RaspberryPI 3 for my home DNS instead. I've had some issues getting the C2000T to honor my settings and not add itself as the DNS resolver, but I have time on my side. If it comes to the worst situation, I'll buy my own gigabit-capable wired router.

Finally, after testing each of the streaming music services out there, I have settled on Google Play/YouTube. They are the only services that managed to work reliably as HTTPS-only. I'm aware that Google is an Ad company by itself, but there's only so much I can do about protecting myself there. I'm already a Project Fi subscriber, GMail user, and a Google Cloud Platform subscriber. They already have most of my data. My opposition is with CenturyLink deciding that they have a right to my data.

Why encrypting everything on the internet makes sense

Thu, 30 Mar 2017 01:00:00 +0000

Published 2017-03-30

TL;DR

Start using HTTPSEverywhere and support the websites that make a point of delivering your data in a way that protects you.

Since the House voted to destroy Privacy rules governing ISPs, I've been reading a lot of discussion on the subject and trying to find ways to keep me and mine under a nice veil of privacy. I have also encountered commentary about how streaming media (like Netflix) shouldn't be secured, because:

There are things which don't need encryption and movie streaming is one of [them]. We don't need the extra power wasted in our world as datacenters are power hungry monsters. Use encryption for what its designed for. Protecting confidential data.

In the end every Netflix user is going to pay the extra bill for this and this is a waste of resources in every possible way.

It's not. All of your activity should be encrypted in ways that cannot be decrypted or tracked and here's why:

Radio and other broadcast media are sent into the ether and broadcasters have no idea exactly who is listening. You can build your own crystal radio kit, listen to a station, and the broadcaster has no idea.

Contrast this with the Internet: every single device on the internet must request data in order to be sent data. Even fancy things like multicasting still require nodes to Join or Leave the network. Each system on the internet is in constant, identifiable communication with other computers in its network.

As such, due to the way Internet Service Providers (ISPs) work, they have the potential to completely track and control your communications unless they're 100% encrypted. The only way to end such invasive, dangerous, and wrong actions by ISPs and other entities is to use encryption from end-to-end.

I'll leave you with a quote from Bruce Schneier (emphasis mine):

Last week, revelation of yet another NSA surveillance effort against the American people has rekindled the privacy debate. Those in favor of these programs have trotted out the same rhetorical question we hear every time privacy advocates oppose ID checks, video cameras, massive databases, data mining, and other wholesale surveillance measures: "If you aren't doing anything wrong, what do you have to hide?"

Some clever answers: "If I'm not doing anything wrong, then you have no cause to watch me." "Because the government gets to define what's wrong, and they keep changing the definition." "Because you might do something wrong with my information." My problem with quips like these -- as right as they are -- is that they accept the premise that privacy is about hiding a wrong. It's not. Privacy is an inherent human right, and a requirement for maintaining the human condition with dignity and respect.

[ . . . ]

Watch someone long enough, and you'll find something to arrest -- or just blackmail -- with. Privacy is important because without it, surveillance information will be abused: to peep, to sell to marketers and to spy on political enemies -- whoever they happen to be at the time.

Privacy protects us from abuses by those in power, even if we're doing nothing wrong at the time of surveillance.

Ongoing encryption efforts

Thu, 23 Mar 2017 01:00:00 +0000

Published 2017-03-23

I have been using a hosts file override for years to cut the ads and "crap" out of my internet browsing experience. I recently discovered a better list and have been using it for a few weeks. I'm getting far more "you have an adblocker installed" warnings from websites I frequent, so it's definitely working better.

Now the Senate has passed laws that permit ISPs to sell my data to advertisers and I'm ready to call Game Over [warning: explicit] on my internet access.

I'm looking at purchasing a Samsung Chromebook Plus but with the Chromebook comes the inability to install an encrypting DNS proxy and Google doesn't seem interested in supporting DnsCrypt with their public servers.

What they do provide is DNS over HTTPS which can be somewhat useful if we're running our own DNS servers at home, but by providing an HTTPS endpoint, Chrome and its extensions could bypass local DNS and make HTTPS-based DNS requests on their own in order to avoid being spoofed (or to hide additional DNS requests).

Many things to research here.

Life in Chrome

Wed, 08 Mar 2017 01:00:00 +0000

Published 2017-03-08

My day-to-day computer activities are mostly through my work-provided laptop. Installing custom compilers or other untested software is expected for my role, and I've been granted administrative rights to my laptop, but we also have corporate antivirus and collective host configuration management. After a chat with the security team, I reminded myself that the security of our endpoints is more important, and deserves my attention.

Being firmly in the middle between Dev and Ops, I have a VM provisioned for my work requirements and it does the job. I don't actually need my local host's terminal to get my job done.

As an experiment, I created a cheap machine in Google Cloud Platform and set it up with my non-work environment, then wiped-and-restored my work laptop, only installing Chrome (and one piece of videoconference software we use) to my user's Applications folder.

Magic! Chrome can SSH to other hosts via an extension, we use Google Apps for Office purposes, and my VM at work took care of the rest. There's even a chrome extension that makes the GCP SSH window work better. Best of all, any "personal projects" I work on are isolated and run little-to-no risk of infecting my work environment and vice versa.

Now if I ever need to wipe the laptop or have it replaced, I can be back up and running in a matter of minutes. Given our upcoming security compliance audits, I can work within every security change because my host does nothing but act as a dumb terminal.

~~If it weren't for my need to have a self-contained laptop at home when the internet needs repair, I could~~ I can get away with working exclusively on a Chromebook, even when I break my router/access point.

Get on the Upgrade Treadmill

Tue, 07 Mar 2017 01:00:00 +0000

Published 2017-03-07

Always upgrade your software. Upgrading now means less pain later.

Recently my company upgraded from Apache Kafka v0.8 to the latest available version (v0.10.2). With it came wire protocol changes, two separate clients were consolidated into only one, Kafka now stores offsets in itself rather than Apache Zookeeper, and a dozen other changes that also needed to be adjusted in our golang services.

One solid outage-into-the-evening later, I've vowed two things:

Always Upgrade Your Software

If you use the latest libraries, the latest software, and constantly keep your systems up-to-date, the incremental changes end up becoming smaller and smaller as new point-versions come out. This get easier as everyone learns to "ride the wave" and make systems smaller. Constant upgrades also suggest (though they do not require) a trend toward smaller services or microservice architectures in order to keep the upgrades small in scope.

Test Everything

If there isn't a test for a block of code, make one. If that code operates against a remote service, build a mock with expectations (or contracts) defined. If that's too much, and you have sufficient resources, run a stripped-down copy of the service locally. If you can't do that, at least consult the documentation and generate a mock out of supposed API docs.
Always generate your own test data. Don't expect someone else to make the test data for you.

Shift back to blogger/ghost

Sun, 05 Mar 2017 01:00:00 +0000

Published 2017-03-05

I've been migrating this blog between various systems for years. I've used, Jekyll, most recently, and each time I've thought the new generator would give me something I needed or wanted.

In the end what really matters is the content, and I feel that in my pursuit of these technologies, I lost that.

I've migrated off of my old domain and, due to a lack of import options in Blogger, had to copy-paste my posts and formatting in to the blogger editor. For the record, the Blogger editor kinda sucks, but when the editor sucks, you just learn how to work with it -- practicing what I preach!

Update: This blog is now running through Ghost & Caddy, on one of my hosts in Google Cloud Platform, which provides automatic TLS and http2 support.

Code Less

Sat, 04 Mar 2017 01:00:00 +0000

Published 2017-03-04

There are a few core pillars of action and design which I've followed to great success over the years. They are as follows...

Always question your instinct to abstract

If you are in a SysAdmins/DevOps role, you need to constantly question the systems you are building. Question every single abstraction or proxy you design. Do you really need to store secrets in Vault, or can you use instance roles/tags/built-in functions of your provider? Do you really need Terraform, or Puppet, when you can accomplish the same with a single Ansible playbook YAML file? (Or better yet, let the application(s) do the configuration!)

This is a common refrain in development circles -- a lot of those I've worked with don't like frameworks and abstractions unless they're absolutely necessary. IT/DevOps/Ops/SysAdmins need to start adopting the same approaches, and when you do, you end up with a simpler and more reasonable system.

As an anecdote, I came across a program that wraps docker and docker-compose, injecting credentials as an added layer directly into the image build process, because developers aren't allowed to have non-dev environment credentials. The better solution is to have each environment inject the credentials into containers at runtime and remove any option to embed credentials into the image at all. In this way, the docker image remains the same across environments (facilitating troubleshooting and reliable testing), yet credentials never cross boundaries.

Do everything in your power to remove abstractions

Do you really need a load-balancer in front of a proxy that communicates with an application? What if I told you you don't need a VPN? As a sysadmin, every extra piece of infrastructure in your environment is another point of failure. Every wrapper shell script, every time you use docker-compose instead of plain docker commands, every instance of make instead of a POSIX-compliant shell ( ash) script ends up being one more barrier to a new-hire onboarding process, or one more hurdle for migrating to a new platform.

Prefer static programs to interpreted/dynamic systems

When possible, use systems that have no external dependencies, no local dependencies, and/or are statically-linked (at compile time). So, given C vs a Java/Python/Ruby program, prefer the C program. When possible, build Go (golang) programs as statically linked rather than relying on dynamic libraries.

By building static packages of software, one can guarantee (or have a high degree of confidence) that the program will still be useful when the system is in a broken state.

Anecdotally, I recently ruined the installation of on my CentOS 7 host because I performed a system-wide pip install --upgrade ...

Static Systems (Continued)

While we're talking about static binaries, let's also consider immutable servers. Building your environment so that you do not change things in-place, but instead build entirely new infrastructure to support the new deployment, while keeping the old one unchanged will sidestep a lot of issues with updating/upgrading any part of the environment.

Document your research and decisions

This is probably the most important pillar, because without it, everything you do is subject to decay. By writing down your experiences, decisions, difficulties and (especially) the business decisions surrounding the final implementation, you not only present a consistent and reliable narrative, but you have the ability to point out changes in circumstance when promoting a new solution to an old/existing problem.

When faced with a problem and you have potential solutions, narrate the problem, present the options, and send that to the stakeholders. If you "have no stakeholders," then You and everyone who may possibly come after you are the stakeholders.

Who is this guy

I'm a SysAdmin in an SRE/DevOps role. I've been in this kind of role for my entire career (10+ years); since before "devops" was a term. As such, I understand software engineering, and can communicate with developers "on their level," but my bread-and-butter is managing systems and infrastructure. I like writing code, and I love developing smart systems. What makes me happiest is simplifying, teaching, and automating myself out of work. I hope that these pillars give you a solid framework to build your own successful career.

Redefining Stability

Wed, 31 Aug 2016 01:00:00 +0000

Published 2016-08-31

Defining stability is a hard problem due to many factors. I propose the use of some guidelines and modifiers to describe various "stable" systems to better describe their true state.

Common Points

The software runs within its defined memory/cpu parameters
The software does not crash under "normal" load

Understable

There is some risk in using the service. Not all edge cases for the API have been tested/bugfixed
Deployments need monitoring by "that one person" who understands the service deeply
Subject to large or frequent code / API updates
Using "new" or untested libraries, repeatedly updated (subject to frequent deployments)

Stable

Most edge-cases have been fixed for the API
Engineering teams maintain and update the code regularly (as needed or monthly)
Relatively few updates to the business-logic code. The API is versioned and/or tested for backwards compatibility
Libraries are updated and the app is refactored to keep pace with security updates and Engineering progress outside the app (e.g. database version changes)

Overstable

So "rock solid" that no one wants to touch it -- it has "hairs" on it in the form of bugfixes and all edge-cases have been handled in code.
Code is reviewed / updated only when absolutely necessary to prevent catastrophic failures. Library and code updates are avoided.
Would be considered "abandonware" in other contexts
(Like understable) requires "that one person" who understands the code to maintain it

Conclusions and Reasoning

Both under- and over-stable systems are in a "bad" state and are something we should avoid.

Having truly Stable code means regularly reviewing or refactoring code to account for new Engineering requirements or practices, fixing or adding pieces as needed, and keeping the team's understanding of the code fresh.

Gitlab Follow-Up

Mon, 29 Aug 2016 01:00:00 +0000

Published 2016-08-29

The team which tested Gitlab for the duration of their sprint had such good feedback during and after the sprint, word-of-mouth prompted other teams to migrate ahead of schedule and start testing it for themselves. This brought up a few new data points and some lessons-learned.

NodeJS and Alpine

Running NodeJS 4.4.7 and npm 3.10.5, somehow our particular installations are incurring an approximately 70-second delay in npm install times when running on Alpine 3.4 instead of Debian Jessie. We still haven't narrowed down where this slowdown occurs yet, but my feeling is we are "doing something wrong" as other NodeJS shops have reported significant gains when running on Alpine.

Also, Java and other languages work fine in Alpine. NodeJS is our only difficulty right now.

Cache server needs a Delete button

While this has more to do with the Minio project than anything, we've had some incidents where the cache zip archive was either corrupted and "killed" the build or it grew so large (4GB+) that it doubled our build time. Especially in the case of a corrupted zip, the build did not continue without the cache files and fully stopped the build with an error message.

To the Gitlab CI folks: this is not how you treat a cache!

The only fix we could come up with was to delete the cache zip and perform a fresh build, but this was not easily accomplished by the developers, especially since they weren't able to issue a delete command either from Gitlab or on the Minio web interface. Instead, we had to go in to the host itself and remove the cache file directly from the disk.

Versioning is an interesting discussion

For a very long time we have used semantic versioning at my company, and the developers wanted a way to tag builds when successful, but the fact is that runner's git checkouts are read-only, there are no arbitrary parameters which can be passed to a build, and builds can happen in various orders (with various build numbers/iterations) decided by the runner based on available resources. And then there's the whole argument about how to enforce SemVer changes, because people inadvertently break the major.minor.patch protocols all the time and there is no way to correct or confirm it in the code.

As such, we had to take a long look at how we versioned our software and our artifacts. The first thing we noted is that we don't have a lot of shared libraries which require semantic versioning. Instead, we just need a way to determine a (fuzzy) magnitude of change and a means of sorting newer vs older artifacts.

(We do this all the time in Golang!) So with a bit of discussion we stumbled on a particularly easy and human-readable solution:

BUILD_VERSION=`git log --oneline | wc -l|tr -d ' '`-`git rev-parse HEAD|head -c 10`

This produces a version of the format - satisfying the (again, fuzzy) magnitude of change, and providing an easy way of determining newness (barring any incidents of force-push rebase operations). Version 4123-2f1df15079 is going to be newer than 4007-fc036c4ee0, sorts properly when using most sorts, and (due to the hash value) guarantees that if versions are produced independently (say by two different developers) they will not have the same version.

I have yet to get a lot of traction with this scheme, as the Cult of SemVer is strong, but for offices where the product is not a code library it doesn't make a lot of sense to use SemVer.

Gitlab and GitlabCI

Tue, 16 Aug 2016 01:00:00 +0000

Published 2016-08-16

My company currently uses Github and Jenkins to handle our source code and build our artifacts/docker images, respectively. We've had to adjust our build system over time as we grow, now running a total of three build agents with 8 cores per agent, and ~20GB of RAM per host.

As a result, we can have as many as 20 builds and/or tests running simultaneously and there is very little "rhyme or reason" to our build schedule save that they almost always happen between 9am and 5pm PST.

Seeing that we were spending $$$$ to run these always-on build-agents and store a limited number of repositories, I decided to look at what it would take to create a dynamic build system.

The Runners Up

Jenkins plugins for docker, kubernetes, etc
Travis CI (and Cirlce CI, Drone.io, ...) and other build-as-a-service platforms
TeamCity
Atlassian Bamboo

Jenkins plugins for docker-based scaling including builds via Kubernetes were all in their alpha state or otherwise unsuitable for use with our environment. The only other alternatiave with Jenkins would be for me to write my own plugin to do exactly what we wanted for dynamic builder creation.

TravisCI and such are great for free/open-source systems, but for a private company with our concurrent build requirements, they were a no-go. Too costly with not enough concurrency for our day-to-day operations.

Similarly, TeamCity and Atlassian Bamboo (the commercial offerings) made all their money around additional build agents. This would encourage us to keep our number of agents small, which is the same situation we were in with Jenkins.

Gitlab

Gitlab CE (Community Edition) came out ahead by having:

Git hosting for unlimited private repos and users (saving us the $450/month we were spending on Github private repositories)
A CI system where the configuration is stored in the repository itself (giving Developers control and revision history)
GitlabCI's configuration is in standard YAML (rather than a weird DSL like Jenkins)
GitlabCI allows for docker-machine operation, creating machines as needed to accomplish its build objectives, then tearing them down (re: deleting them from our cloud provider) when idle. Also all operations happen within docker containers, meaning provisioning operations are not required for the runners.

Needless to say, I was impressed with it.

I had to patch docker-machine due to some non-standard configuration issues we had in Google Compute Platform, but that only took a day to figure out. Once I had the patched docker-machine installed on the Gitlab host, I configured the runner to use docker-machine and launch a system with privleged containers. After all that, we were in business!

The configuration YAMLs for most of our services are less than 50 lines, so it didn't take long to get a few services migrated over.

Caveats and Issues

Using git to commit results back into the repository is not possible (by default) in a build. Runners have a read-only copy of the repository, but Artifacts can be saved to the master (and are available for download after the build). If you want to commit the results back in to the git repo, you'll have to create a docker image with credentials built in.

Since runners are created and destroyed regularly, docker images should be optimized for size (i.e. use Alpine linux as a base instead of Debian) and one should have a mirror if the docker registry isn't on-network. Reducing the transfer requirements will greatly accelerate builds.

Dogfooding and Impressions

One of the teams nearby is almost always willing to try the "new stuff" first and give feedback / input, so I have them using Gitlab for their current sprint. So far, they are happy with it and excited about the automatic testing and build operations.

Tech Choices and Pitfalls for DevOps

Wed, 16 Mar 2016 01:00:00 +0000

Published 2016-03-16

DevOps encompasses ideas which fly in the face of the last few decades of systems design. Similarly, adoption of cloud-based architectures also disrupt the last few decades of intuition and "common sense."

Below is a clarification of technologies, reasons for their adoption and usage, why their avoidance is preferable. Interspersed are snippets regarding pitfalls with respect to purely virtual (cloud) architectures and their usage.

Containers - Docker

Let's talk about docker. Actually, let's talk about containerization.

Containers are a godsend to engineering as a whole. With a container, all code, library dependencies, and other "must be set up" requirements are handled in a single, atomic, blob of data for deployment. The only things that a container can't handle are dependent services (e.g. external databases).

As a consequence, those containers are nearly everything one needs to deploy new software. If docker (or a container runtime) is installed, use the facilities provided to download the container and start it. Suddenly, your Java/NodeJS/Go/Python/Rust/C++/Wordpress machination is up-and-running without having to install anything.

Docker is the de-facto format and runtime for managing containers, though there are new contenders for container creation and management, like rkt. Considering rkt uses the open-container-initiative's specifications, whereas Docker has already started to diverge from that, it's a question about whether rkt may be a better choice going forward.

If you aren't using containers of some type yet, you are missing out.

UPDATE -- 2016/04/17 -- Docker v1.11 uses the OCI contianer specification.

Configuration Management - Ansible

There is a lot of discussion across the internet about Puppet, Chef, and SaltStack, but the real winner is Ansible. Ansible does everything that the docker container can't handle, like basic systems provisioning.

What makes Ansible the winner here? Puppet, Chef, and even Salt require bootstrapping of some kind -- some kind of Agent installed on minions that let them communicate with their Master server. Ansible has none of that. Working purely over python & ssh, Ansible's default is to connect to a box using SSH and perform actions without leaving a trace that it was there.

What about code changes? What about half-completed operations? Those are handled with Assertions in an ansible playbook. When things aren't the way your playbook expects them to be, Ansible will complain, LOUDLY, and refuse to do it. No automatic attempts at resolving the issue or starting things in the background (unless you programmed it that way), just a not-so-quiet failure and the reasons for said failure.

You might ask "what about audit trails and other security functions." Since Ansible runs over SSH, whoever has Ansible access has SSH access, meaning they are limited by what they could do if they were to remotely connect to the machine and run the commands themselves. Thus Ansible does not provide an additional route into a box, but automation over an existing access method. To avoid granting all and sundry SSH access, find ways to push data away from the box and feed it into things like centralized logging platforms, provide web access to diagnostic systems, and so on. And if you want an audit trail for the changes to playbooks, follow the best practices and build your infrastructure immutably.

Scripting and Development - Go (and others)

It really doesn't matter so much what language you choose, so long as you understand said language. If, however, you're in a "green field" situation or have the room to really dig in and adopt a new language, give a shot. It's very much a "blue-collar language" in the same vein as Java -- one can write themselves out of a problem, without needing to be "clever." This is the explicit design of the language:

The key point here is our programmers are Googlers, they're not researchers. They're typically, fairly young, fresh out of school, probably learned Java, maybe learned C or C++, probably learned Python. They're not capable of understanding a brilliant language but we want to use them to build good software. So, the language that we give them has to be easy for them to understand and easy to adopt. -- Rob Pike

And...

It must be familiar, roughly C-like. Programmers working at Google are early in their careers and are most familiar with procedural languages, particularly from the C family. The need to get programmers productive quickly in a new language means that the language cannot be too radical. -- Rob Pike

Bash, python, perl, scala, java, clojure, ruby, or any other language will do just fine. Go wins this contest by being built for productivity and by permitting statically compiled executables (and cross-compilation without a new toolchain) for great portability.

Cloud Provider - Google Compute Platform (Digital Ocean)

Cloud providers are expensive compared to buying one's own hardware for simple things, but the benefits of adaptability and no need to negotiate colocation contracts. Based on the latest information I can gather from the time of this post (March 2016), Google Compute Platform provides approximately ~20% savings over other providers for sufficiently large deployments. Depending on your use case, you can save more (or none at all) depending on your needs.

Google Compute Platform is also undeniably Google, meaning you can be forced into doing things The Google Way, but there is great simplicity in its design and many provided services make for a compelling place to put your startup.

For small, personal, or green-field projects, Digital Ocean has lately been the best-of-breed among providers and integrates nicely with many provisioning systems (like Ansible).

Changes when working with cloud providers

One important change to note is that Engineers, when considering things like logging or caching information, must recognize that the "X is cheap" philosophies don't necessarily apply. As an example, yes, disk is cheap, but SSD-scale attached storage costs over four times the spindle-disk equivalent on Google Compute Platform. More so, both storage types have a maximum throughput where adding more storage does not grant ever more throughput or IOPS. So this is with all other resources in any given cloud environment.

Similarly, one must build applications with an eye toward recovery and adaptation. Network reliability is not 100% and is usually around a 98-99% SLA. By definition, this means that a network (or only part of it) could be unreachable for well over ten minutes of cumulative time in a day. To combat this, one must do things like permit retries when submitting information to an API, or include in your code, and things like reconnecting to databases. Always try to implement backoff code to avoid the "thundering herd" problem when those systems do reconnect.

Finally, if a system is built with all of these aspects in mind, use (free!) encryption and authentication for all endpoints. Single-factor authentication might be sufficient when communicating between servers, but for user-level access, adopt 2FA if you mean to do anything remotely sensitive. Yes, this includes something as simple as storing a user's address.

More to come

I sincerely hope I haven't put anyone off with this, and I am anxious to have more suggestions and feedback as I am exposed to more of the "soft bits" toward being in DevOps. To mangle a friend's phrases, "technology choices are a very junior concern. No one cares. Focus on the business objectives, because that's all that really matters."

Software has boundaries; respect them

Fri, 18 Dec 2015 01:00:00 +0000

Published 2015-12-18

This post should be classified as a rant, and it's been brewing for a while.

All software has boundaries. When a system is built, the boundaries are defined and the end user (often) cannot change them. This doesn't mean we can't have dynamic software that does a great many flexible things (especially with embedded scripting languages like Lua) but it does mean that there are invariants in that program.

What I cannot understand is why a great many engineers look at a system which they themselves didn't build and expect it to do exactly what they want. "It should work this way" is often heard and that statement is entirely unhelpful. Systems don't change just because you want them to. Computers are dumb and do exactly what they are told and those programs may not allow you any form of customization. That said, if you didn't write the code for a given system, then you need to either adapt yourself to the system or find something else.

End Users understand this. They know they are given a box-within-a-box, full of constraints, and they will work day and night to find ways to make that box sing. What makes it any different for Engineers?

None of this is to say that You, the engineer / software developer / DevOps person / IT guy, can't write your own version of the software sitting in front of you. Until you do, especially if you're dealing with a must-use system at work, take the software/platform/thing for what it is and learn it. The more you fight it, the harder it is for the rest of us, to no one's benefit.

Don't worry about NAT, just use firewalls

Mon, 14 Dec 2015 01:00:00 +0000

Published 2015-12-14

I hear again and again about how NAT is a "best practice" and should be used as the base of network design. When working in the Google Cloud (and Amazon EC2), publicly routeable IPv4 addresses are provided to each machine "for free" within Google's IPv4 allocation block. When this is first brought up, everyone reacts defensively and cautiously (good job, security teams -- you finally got that message out). Once it is fully explained that having an external IP address does not inherently reduce the security of a machine, people tend to see it as a benefit.

Stack Overflow threads actually addressed my points much better than I had myself, so here is one snippet and one full-length response regarding NAT...

While this page is addressing IPv6, the same information applies:

We can argue the merits of NAT, the end-to-end principle, and security until we're blue in the face -- and many have -- but the reality is that NAT does not provide any real network security. Worse yet, it actually prevents many security measures and provides an additional attack surface for your network.

And this ServerFault page regarding IPv6 and "NAT being a thing of the past" actually addresses a network with all IPv4 addresses publicly accessible:

First and foremost, there is nothing to fear from being on a public IP allocation, so long as your security devices are configured right.

What should I be replacing NAT with, if we don't have physically separate networks?

The same thing we've been physically separating them with since the 1980's, routers and firewalls. The one big security gain you get with NAT is that it forces you into a default-deny configuration. In order to get any service through it, you have to explicitly punch holes. The fancier devices even allow you to apply IP-based ACLs to those holes, just like a firewall. Probably because they have 'Firewall' on the box, actually. A correctly configured firewall provides exactly the same service as a NAT gateway. NAT gateways are frequently used because they're easier to get into a secure config than most firewalls.

I hear that IPv6 and IPSEC are supposed to make all this secure somehow, but without physically separated networks that make these devices invisible to the Internet, I really can't see how.

This is a misconception. I work for a University that has a /16 IPv4 allocation, and the vast, vast majority of our IP address consumption is on that public allocation. Certainly all of our end-user workstations and printers. Our RFC1918 consumption is limited to network devices and certain specific servers where such addresses are required. I would not be surprised if you just shivered just now, because I certainly did when I showed up on my first day and saw the post-it on my monitor with my IP address. And yet, we survive. Why? Because we have an exterior firewall configured for default-deny with limited ICMP throughput. Just because 140.160.123.45 is theoretically routeable, does not mean you can get there from wherever you are on the public internet. This is what firewalls were designed to do. Given the right router configs, and different subnets in our allocation can be completely unreachable from each other. You do can do this in router tables or firewalls. This is a separate network and has satisfied our security auditors in the past.

There's no way in hell I'll put our billing database (With lots of credit card information!) on the internet for everyone to see.

Our billing database is on a public IPv4 address, and has been for its entire existence, but we have proof you can't get there from here. Just because an address is on the public v4 routeable list does not mean it is guaranteed to be delivered. The two firewalls between the evils of the Internet and the actual database ports filter out the evil. Even from my desk, behind the first firewall, I can't get to that database. Credit-card information is one special case. That's subject to the PCI-DSS standards, and the standards state directly that servers that contain such data have to be behind a NAT gateway[1]. Ours are, and these three servers represent our total server usage of RFC1918 addresses. It doesn't add any security, just a layer of complexity, but we need to get that checkbox checked for audits.

The original "IPv6 makes NAT a thing of the past" idea was put forward before the Internet boom really hit full mainstream. In 1995 NAT was a workaround for getting around a small IP allocation. In 2005 it was enshrined in many Security Best Practices document, and at least one major standard (PCI-DSS to be specific). The only concrete benefit NAT gives is that an external entity performing recon on the network doesn't know what the IP landscape looks like behind the NAT device (though thanks to RFC1918 they have a good guess), and on NAT-free IPv4 (such as my work) that isn't the case. It's a small step in defense-in-depth, not a big one. The replacement for RFC1918 addresses are what are called Unique Local Addresses. Like RFC1918, they don't route unless peers specifically agree to let them route. Unlike RFC1918, they are (probably) globally unique. IPv6 address translators that translate a ULA to a Global IP do exist in the higher range perimeter gear, definitely not in the SOHO gear yet. You can survive just fine with a public IP address. Just keep in mind that 'public' does not guarantee 'reachable', and you'll be fine.

1: The PCI-DSS standards changed in October 2010, the statement mandating RFC1918 addresses was removed, and 'network isolation' replaced it.

Key Points here: "The two firewalls between the evils of the Internet and the actual database ports filter out the evil." Also "just keep in mind that 'public' does not guarantee 'reachable' and you'll be fine."

To reiterate the response, having a default-deny firewall (which the major cloud-provider firewalls are) and a cautious approach to opening ports or whitelisting traffic is sufficient to accomplish the same goals as NAT, with the only exception of "getting around a small network allocation."

If you have a small fixed number of IP addresses, NAT is your only easy way of getting more addresses than you have. Otherwise, it's unnecessary.

Reasons why VPN in the Google Cloud environment is superfluous

Thu, 10 Dec 2015 01:00:00 +0000

Published 2015-12-10

VPN's main functions are two-fold: encapsulation and encryption

VPN encapsulates the intended transmission from one network to another. Encryption is used to hide the data from any systems in-flight.

Delivering data from one network to another has nothing to do with addresses -- routers will push packets regardless. Encryption is usable at all levels, can exist in multiple layers, and can be strengthened or improved independently of the service itself (e.g. use ed25519 keys instead of RSA). VPN is an unnecessary addition to the security model because the requisite pieces are already provided via firewalls and encryption services like SSH and TLS.

VPN requires "turning it on" when accessing resources

VPN adds another "switch" to flip when performing "real work," especially when working from home or unfamiliar environments. Employees can, and will, forget to enable the VPN when working, or will have trouble establishing a VPN tunnel from a Starbucks when on-call. This can and does increase troubleshooting load, including maintenance and support load from IT, even when everything is functioning normally because the VPN adds cognitive overhead to an Engineer's debugging process. Additionally, during high-stress situations employees (including Management) may circumvent the VPN for the sake of expediency, which removes its protection.The VPN provides a single point of failure -- accessing systems over the office-bound VPN failed repeatedly while we tried to resolve networking issues. For several days there was intermittent connectivity to internal addresses in GCP, greatly hampering developer productivity.Using external IP address access for Engineering operations means connections are always the same, work successfully over the public internet, and can be verified, every time. No one has to remember to turn on an External Interface in order to access a service, they simply connect and are authenticated through. When paired with multifactor authentication like and using hardware tokens like the thoroughly-vetted-and-reviewed Yubikey, systems are often more secure than when accessed using an authenticated VPN connection.

VPN provides a false sense of security

VPNs are usually evangelized as a reason for everyone to "rest easy" knowing that they are operating within a perceived-controlled environment. This cannot be further from the truth -- the networked environment is just as vulnerable (if not more so) than one's own computer. Even worse, should any one employee's laptop be compromised, VPN access is at the attacker's control and once through the VPN they have free reign among the unencrypted internal services.
In contrast, using multiple layers of authorization and encryption (RE: Duo Security and TLS/SSH) mean that the connections are always secured and can be limited in their scope. Should one find that a given app or host is misbehaving, they can change authentication credentials and audit the offending part instead.

Specifying multiple layers of firewalls is more restrictive by default than VPN

VPNs tend to be set up such that each endpoint provides access to the rest of the network as a sort of "high ground." There are no real limits or tests of traffic from the VPN into the rest of the network. Some VPNs provide VLAN isolation, but these VPNs are meant to allow engineering work to happen, and so often provide access directly to the most desirable targets.Using multiple layers of firewalls instead means that traffic is constantly checked, both on initial ingress and further attempts to contact / connect with other instances. This means that a person or system which isn't supposed to contact the database cannot contact the database despite otherwise having access.

Responses

Isn't using a VPN considered a best-practice

It is, but only due to historical issues which have nothing to do with current technology.The reasons for using VPNs are tied to old-school business models (i.e. datacenter-to-datacenter and office-to-office connectivity), old software which does not support encryption, IPv4 limitations, and enshrined "Security Best Practices":

NAT helped reduce the issue of publicly routable IPv4 address scarcity/cost, which meant leveraging more private networks (RFC1918)
Offices would have a server-closet, or a datacenter location, storing all of their data. That system would live in one centralized location, prompting the question of how to access data from another branch office without using the public internet.
https://en.wikipedia.org/wiki/Timeline_of_file_sharing
Encryption was either expensive, slow, or expressly forbidden from leaving the country. This meant that being able to account for secured transmissions between two points within the country was extremely valuable. (This is a guarantee that public internet routes could not necessarily enforce.)
Multifactor authentication was seen as unnecessary for all but "Top Secret Government stuff" and when software companies provided things like "dongles" using their own authentication methods, they were circumvented or exploited, or usable on a single type of system (no interoperability).

None of these issues apply today:

IPv4 addresses are still scarce, but Google Compute Platform and other cloud providers will provide public IPs to hosts for free. One is not charged for their existence or use (unless an address is specifically reserved and goes unused).
Modern systems use a variety of strong encryption methods to secure transfers like TLS and SSH.
https://en.wikipedia.org/wiki/AES_instruction_set
TLS provides strong encryption and can work over any TCP connection, cheaply. Network transfer is also equally cheap.
Multifactor authentication is easily added to most systems and is implemented using publicly tested implementations, interoperable with any platform.

For all of these reasons, we don't need VPN if we can encrypt and authenticate our services. None of the technology is stopping us.

Google provides VPN services, so why would they provide it if it isn't needed

Google works with many companies which have security requirements that mandate the use of a VPN. (See reasons above.) If they did not provide "feature parity" with those legacy systems, they wouldn't be able to work with the larger companies that give them lots of money in hosting costs. In short, it costs them almost nothing to offer it, and they give their clients the ability to check a box in their requirements documentation for security audits. Details here.

Google uses things like the Yubikey and authenticating-gateways to secure their networks since they transitioned to a zero-trust design after the detection of an APT in 2009. If one of the largest and most-exposed companies is using a zero-trust model, one would assume it succeeds where a perimeter-defense design failed.

What about risk of exposure? What about the additional "attack surface"

Port 22 (SSH) is open to the world on all GCP instances, as SSH is their only means of access. Botnets are constantly trying to access those systems with common accounts (e.g. admin, cisco, etc) and yet none of them get in. They are refused precisely because password-based logins are not used (unless someone explicitly does so). In the past, using SSH keys was believed considerably more effort than it was worth -- "just use a password" -- until enough systems were attacked due to weak passwords. During a recent port scan of an in-use production network, the only ports publicly accessible to an outside IP address were the following:

port 22 -- for SSH.
port 80, 443 -- for 14 specifically identified webservers.

Otherwise, the firewall denied anything not explicitly approved. The added fact that the only hosts which responded to cleartext requests were explicitly and specifically designed to do so should say something about the security of the setup.