Why we use our own hardware

49 comments

The original answer to "why does FastMail use their own hardware" is that when I started the company in 1999 there weren't many options. I actually originally used a single bare metal server at Rackspace, which at that time was a small scrappy startup. IIRC it cost $70/month. There weren't really practical VPS or SaaS alternatives back then for what I needed.

Rob (the author of the linked article) joined a few months later, and when we got too big for our Rackspace server, we looked at the cost of buying something and doing colo instead. The biggest challenge was trying to convince a vendor to let me use my Australian credit card but ship the server to a US address (we decided to use NYI for colo, based in NY). It turned out that IBM were able to do that, so they got our business. Both IBM and NYI were great for handling remote hands and hardware issues, which obviously we couldn't do from Australia.

A little bit later Bron joined us, and he automated absolutely everything, so that we were able to just have NYI plug in a new machine and it would set itself up from scratch. This all just used regular Linux capabilities and simple open source tools, plus of course a whole lot of Perl.

As the fortunes of AWS et al rose and rose and rose, I kept looking at their pricing at features and kept wondering what I was missing. They seemed orders of magnitude more expensive for something that was more complex to manage and would have locked us into a specific vendor's tooling. But everyone seemed to be flocking to them.

To this day I still use bare metal servers for pretty much everything, and still love having the ability to use simple universally-applicable tools like plain Linux, Bash, Perl, Python, and SSH, to handle everything cheaply and reliably.

I've been doing some planning over the last couple of years on teaching a course on how to do all this, although I was worried that folks are too locked in to SaaS stuff -- but perhaps things are changing and there might be interest in that after all?...

>As the fortunes of AWS et al rose and rose and rose, I kept looking at their pricing at features and kept wondering what I was missing. They seemed orders of magnitude more expensive for something that was more complex to manage and would have locked us into a specific vendor's tooling. But everyone seemed to be flocking to them.

In 2006 when the first aws instances showed up it would take you two years of on demand bills to match the cost of buying the hardware from a retail store and using it continuously.

Today it's between 2 weeks for ML workloads to three months for the mid sized instances.

AWS made sense in big Corp when it would take you six months to get approval for buying the hardware and another six for the software. Today I'd only use it to do a prototype that I move on prem the second it looks like it will make it past one quarter.

Aws is useful if you have uneven loads. why pay for the number of servers you need for christmas the rest of the year? But if your load is more even it doesn't make as much sense.

The business case I give is a website which has a predictable spike in traffic which tails off.

In the UK we have a huge charity fundraising event called Red Nose Day and the public can donate online (or telephone if they want to speak to a volunteer).

The website probably sees 90% of their traffic on the day itself - millions of users - and the remaining 10% tailing off a few days later. Then nothing.

The elasticity of the cloud allows the charity to massively scale their compute power for ONE day, then reduce it for a few days, and drop back down to a skeleton infrastructure until the next event - in a few years time.

(FWIW I have no clue if Red Nose Day ever uses the cloud but it's a great example of a business case requiring temporary high capacity compute to minimise costs)

But how does it look from aws point of view?

Everyone scales up around Christmas then scales down afterwards. What do THEY do with all the unneeded CPU-seconds for the rest of the year?

Only consumer scales up for the holidays. Most other industries scale down. The more companies they have, the more even the overall demand is for them.

Also, every unused resource goes into the spot market. They just have a bigger spot market during the year.

And lastly, that's why they charge a premium. Because they amortize the cost of spare hardware across all their customers.

We certainly don't scale up around Christmas. Appart from online shops and shipping companies, why would everyone else scale up around Christmas?

Not everyone. Ag is in a low time around then and scales way back. I don't know what other industry is like.

This.

Plus bidding on spot-instances used to be far less gamed so if you had infrequent batch jobs (just an extreme version of low-duty-cycle loading), there was nothing cheaper and easier.

I've been out of that "game" for a bit, but Google Compute used to have the cheapest bulk-compute instance pricing if all you needed was a big burst of CPU.

It's all changed if you're running ML workloads though.

AWS was built on hordes of VC backed startups drowning in heaps of cash and very little operational expertise.

"buying the hardware from a retail store." Never buy wholesale and never develop on immature hardware, I have seen c** with multiple 9 y.o. dev servers. I could shorten the ROI to less than 6 months.

What is c*? (seriously, I am not a native speaker and cannot turn the stars into a word that makes sense)

No worries, I AM a native speaker and I can't figure out the stars OR the specific parsing of that comment

> As the fortunes of AWS et al rose and rose and rose, I kept looking at their pricing at features and kept wondering what I was missing.

You are not the only one. There are several factors at play but I believe one of the strongest today is the generational divide: the people lost the ability to manage their own infra or don't know it well enough to do it well so it's true when they say "It's too much hassle". I say this as an AWS guy who occasionally works on on-prem infra.[0]

[0] As a side note, I don't believe the lack of skills is the main reason organizations have problem - skills can be learned, but if you mess up the initial architecture design, fixing that can easily take years.

> I don't believe the lack of skills is the main reason organizations have problem

IDK. More and more I see the argument of “I don’t know, and we are not experts in xxx” as a winning argument of why we should just spend money on 3rd party services and products.

I have seen people getting paid 700k plus a year spend their entire stay at companies writing papers about how they can’t do something and the obvious solution is to spend 400k plus to have some 3rd party handle it, and getting the budget.

Let’s not get into what the conversation looks like when somebody points out that we might have an issue if we are paying somebody 700k to hire somebody else temporarily for 400k each year, and that we should find these folks who can do it for 400k and just hire Them.

All this to say that being a SWE in many companies today requires no ability to create software that solves business problems. But rather some sort of quasi system administrator manager who will maybe write a handful of DSL scripts over the course of their career.

It’s also human capital/resource allocation. We thought about spinning up our own servers at my last gig; we had the talent in house but that talent was busy building the product, not managing servers. I suppose it depends on what your need is as well.

>As the fortunes of AWS et al rose and rose and rose, I kept looking at their pricing at features and kept wondering what I was missing. They seemed orders of magnitude more expensive [...] To this day I still use bare metal servers for pretty much everything, [...] plain Linux, Bash, Perl, Python, and SSH, to handle everything cheaply

Your FastMail use case of (relatively) predictable server workload and product roadmap combined with agile Linux admins who are motivated to use close-to-bare-metal tools isn't an optimal cost fit for AWS. You're not missing anything and FastMail would have been overpaying for cloud.

Where AWS/GCP/Azure shine is organizations that need higher-level PaaS like managed DynamoDB, RedShift, SQS, etc that run on top of bare metal. Most non-tech companies with internal IT departments cannot create/operate "internal cloud services" that's on par with AWS.[1] Some companies like Facebook and Walmart can run internal IT departments with advanced capabilities like AWS but most non-tech companies can't. This means paying AWS' fat profit margins can actually be cheaper than paying internal IT salaries to "reinvent AWS badly" by installing MySQL, Kafka, etc on bare metal Linux. E.g. Netflix had their own datacenters in 2008 but a 3-day database outage that stopped them from shipping DVDs was one of the reasons they quit running their datacenters and migrated to AWS.[2] Their complex workload isn't a good fit for bare-metal Linux and bash scripts; Netflix uses a ton of high-level PaaS managed services from AWS.

If bare metal is the layer of abstraction the IT & dev departments are comfortable working at, then self-host on-premise, or co-lo, or Hetzner are all cheaper than AWS.

[1] https://web.archive.org/web/20160319022029/https://www.compu...

[2] https://media.netflix.com/en/company-blog/completing-the-net...

> although I was worried that folks are too locked in to SaaS stuff

For some people the cloud is straight magic, but for many of us, it just represents work we don't have to do. Let "the cloud" manage the hardware and you can deliver a SaaS product with all the nines you could ask for...

> teaching a course on how to do all this ... there might be interest in that after all?

Idk about a course, but I'd be interested in a blog post or something that addresses the pain points that I conveniently outsource to AWS. We have to maintain SOC 2 compliance, and there's a good chunk of stuff in those compliance requirements around physical security and datacenter hygiene that I get to just point at AWS for.

I've run physical servers for production resources in the past, but they weren't exactly locked up in Fort Knox.

I would find some in-depth details on these aspects interesting, but from a less-clinical viewpoint than the ones presented in the cloud vendors' SOC reports.

I’ve never visited a datacenter that wasn’t SOC2 compliant. Bahnhof, SAVVIS, Telecity, Equinox etc.

Of course, their SOC 2 compliance doesn't mean we are absolved of securing our databases and services.

Theres a big gap between throwing some compute in a closet and having someone “run the closet” for you.

There is, a significantly larger gap between having someone “run the closet” and building your own datacenter from scratch.

A datacenter being soc2 compliant doesn’t mean any of your systems are. Same with pci. Same with hipaa. Cloud providers usually have offerings that help meet those requirements as well, but again, you can host bare metal, colo, cloud, or a tower under your bed, their compliance doesn’t do anything to cover your compliance.

Yes, quite right, that’s what I meant with my “I still have to do the work of securing my services”.

Would be the same no matter where I’m hosted.

Going to guess you meant to reply to the parent though?

You're describing stuff the colo provider does. I have no plans to describe how to setup a colo provider. I've never done that, and haven't seen the need. The cost of colo is not that significant.

In my 25 years, I've run some really big on-prem workloads and some of the biggest cloud loads (Sendmail.org and it's mail servers and Netflix streaming). Here is why I like the cloud:

Flexibility.

When Netflix wanted to start operating in Europe, we didn't have to negotiate datacenter space, order a bunch of servers, wait for racking and stacking, and all those other things. We just made an API call and had an entire stack built in Europe.

Same thing we we expanded to Asia.

It also saved us a ton of money, because our workload was about 3x peak to trough each day. We would scale up for peak, and scale down for trough.

We used on-prem for the parts where that made sense -- serving the actual video bits. Those were done on custom servers with a very stripped down FreeBSD optimized just for serving video (so optimized that we still used Akamai for images). But the part of the business that needed flexibility (control plane and interface) were all in AWS.

Why would a startup use the cloud? Both flexibility and ease. There aren't a lot of experts around that can configure a linux box from scratch anymore. And even if you can, you can't go from coded-up idea to production in five minutes like you can with the cloud. It would take you at least a few hours to set up the bare metal the first time.

When you say “cloud”, are you including old school web hosts that will rent you a dedicated server?

Like OVH, Hetzner or Hivelocity?

Because you can get some insane servers for like $300/month (eg brand new 5th gen Epyc 48-core / 0.5TB ram / lots of NVME) and globally available.

Those could count. But you'll still end up having to do some linux admin, which a lot of people can't do anymore.

The whole point is that the closer you can get to "write code, run code", the faster you can launch and innovate.

Linux admin still exists. Except that they are better paid than ever at cloud provider. What you're describing is more payroll flexibility than technical.

How is it not technical flexibility? No matter what talent you have on payroll, you can't spin up a whole datacenter's worth of machines in Europe in less than a day without a cloud provider.

And I mean less than a day from "I think we should operate in Europe" to "we are operating production workloads in Europe".

It sounds like you’re describing PaaS then.

I used to help manage a couple of racks worth of on premise hw in early to mid 2000.

We had some old Compaq (?) servers, most of the newer stuff was Dell. Mix of windows and Linux servers.

Even with the Dell boxes, things wasn't really standard across different server generations, and every upgrade was bespoke, except in cases when we bought multiple boxes for redundancy/scaling of a particular service.

What I'd like to see is something like oxide computer servers that scales way down at least down to quarter rack. Like some kind of Supermicro meets backlblaze storage pod - but riffing on Joyent's idea of colocating storage and compute. A sort of composable mainframe for small businesses in the 2020s.

I guess maybe that is part of what Triton is all about.

But anyway - somewhere to start, and grow into the future with sensible redundancies and open source bios/firmware/etc.

Not typical situation for today, where you buy two (for redundancy) "big enough" boxes - and then need to reinvent your setup/deployment when you need two bigger boxes in three years.

Yeah, having something like oxide but smaller would be awesome.

AWS is only expensive if you intend to run a lot of workloads and have a large, competent technical team.

For businesses with <10 servers and half an IT person, the cost difference is practically irrelevant. EC2+EBS+snapshots is a magic bullet abstraction for most scenarios. Bare metal is nice until parts of it start to fail on you.

I can teach someone from accounting how to restore the entire VM farm in an afternoon using the AWS web console. I've never seen an on prem setup where a similar feat is possible. There's always some weird arcane exceptions due to economic compromises that Amazon was not forced to make. When you can afford to build a fleet of data centers, you can provide a degree of standardization in product offering that is extraordinarily hard to beat. If your main goal is to chase customers and build products for them, this kind of stuff goes a long way.

Long term you should always seek total autonomy over your information technology, but you should be careful to not let that goal ruin the principal business that underlies everything.

I'm confused why you would even need AWS then (what's running on the VMs)?

My impression is the standard compute (as in CPUs+RAM) isn't expensive, it's the storage (1 PB is less than half a rack physically now, comparing with the yearly prices listed), and so if you don't have much data, the value of on-prem isn't there.

For smaller shops I'd argue storage is the hardest part. I've done several OpenStack and baremetal K8s deployments on prem and the part that always stressed me out the most was storage. I'd happily pay a markup for that vs just about anything else that would be more economical to do on prem for smaller simpler workloads.

Also encrypted storage on AWS is so simple. Encrypted root file systems on prem is not easy.

This is it for me too. EBS is a bigger deal than the EC2 instances themselves.

As someone who lived through that era, I can tell you there are legions of devs and dev adjacent people who have no idea what it’s like to automate mission critical hardware. Everyone had to do it in the early 2000s. But it’s been long enough that there are people in the workforce who just have no idea about running your own hardware since they never had to. I suspect there is a lot of interest, especially since we’re likely approaching the bring it back in house cycle, as CTOs try to reign in their cloud spend.

>But everyone seemed to be flocking to them.

To the point we have young Devs today that dont know what VPS and Colo ( Colocation) meant.

Back to the article, I am surprised it was only a "A few years ago" Fastmail adopted SSD. Which certainly seems late in the cycle for the benefits of what SSD offers.

Price for Colo on the order of $3000/2U/year. That is $125 /U/month.

We adopted SSD for the current week's email and rust for the deeper storage many years ago. A few years ago we switched to everything on NVMe, so there's no longer two tiers of storage. That's when the pricing switched to make it worthwhile.

> Which certainly seems late in the cycle for the benefits of what SSD offers.

90% of emails are never read, 9% are read once. What SSD could offer for this use case except at least 2x cost ?

Don't forget that fastmail is through an internet transport with enough latency to make hdd seek times noise

Colo is typically sold on power not space, from your example you're either getting ripped off if it's for low power servers or massively undercharged for a 4xa100 machine

HDDs are still the best option for many workloads, including email.

What??

I can get an entire rack at Equinix for ~1200/mo with an unlimited 10g internet connect.

Please do this course. It's still needed and a lot of people would benefit from it. It's just that the loudest voices are all in on Cloud that it seems otherwise.

> I've been doing some planning over the last couple of years on teaching a course on how to do all this

Yes! It's surprisingly common to hear it can't work, or can't scale or run reliably, when all that is done. Talking about how you've done it is great from that perspective.

Also, it's worth talking about what you gain, qualitatively! As this post mentions, your high-performance storage options are far better outside the cloud. People often mention egress, too. The appealing idea to me is using your extra flexibility to deploy better stuff, not saving a bit of cost.

You know how to set up a rock-solid remote hands console to all your servers, I take it? Dial-up modem to a serial console server, serial cables to all the servers (or IPMI on a segregated network and management ports). Then you deal with varying hardware implementations, OSes, setting that up in all your racks in all your colos.

Compare that to AWS, where there are 6 different kinds of remote hands, that work on all hardware and OSes, with no need for expertise, no time taken. No planning, no purchases, no shipment time, no waiting for remote hands to set it up, no diagnosing failures, etc, etc, etc...

That's just one thing. There's a thousand more things, just for a plain old VM. And the cloud provides way more than VMs.

The number of failures you can have on-prem is insane. Hardware can fail for all kinds of reasons (you must know this), and you have to have hot backup/spares, because otherwise you'll find out your spares don't work. Getting new gear in can take weeks (it "shouldn't" take that long, but there's little things like pandemics and global shortages on chips and disks that you can't predict). Power and cooling can go out. There's so many things that can (and eventually will) go wrong.

Why expose your business to that much risk, and have to build that much expertise? To save a few bucks on a server?

It's really not like that at all. If it was, I expect after 25 years of growth FastMail would probably have noticed. Much of what you're describing assumes a poorly run company that isn't able to make good choices -- if you have such a mix of odd hardware os OSes then that's pretty bad sign.

Prioritise simplicity.

For remote hands, 2 kinds is sufficient: IP KVM, and an actual person walking over to your machine. Can't say I've had an AWS person talk to me on a cell phone whilst standing at my server to help me sort out an issue.

It's actually really fun, and saving 90% what can be your largest cost can actually be a fundamental driver of startup success. You can undercut the competition on price and offer stuff that's just not available otherwise.

Every time this conversation has come up online over the last few decades there's always a few people who parrot this claim it's all too hard. I can't imagine these comments come from people that have actually gone and done it.

> Every time this conversation has come up online over the last few decades there's always a few people who parrot this claim it's all too hard. I can't imagine these comments come from people that have actually gone and done it.

My experience of this is that people either fall into the camp of having done it under a set of non-ideal constraints (leading them to do it badly), or it's post-rationalising that they just don't want to.

> Hardware can fail for all kinds of reasons

Complex cloud infra can also fail for all kinds of reasons, and they are often harder to troubleshoot than a hardware failure. My experience with server grade hardware in a reliable colo with a good uplink is it's generally an extremely reliable combination.

And my experience is the opposite, on both counts. I guess it's moot because two anecdotes cancel each other out?

Cloud VMs fail from either the instance itself not coming back online, or an EBS failure, or some other az-wide or region-wide failure that affects networking or control plane. It's very rare, but I have seen it happen - twice, across more than a thousand AWS accounts in 10 years. But even when it does happen, you can just spin up a new instance, restoring from a snapshot or backup. It's ridiculously easier to recover than dealing with an on-prem hardware failure, and actually reliable, as there's always capacity [I guess barring GPU-heavy instances].

"Server grade hardware in a reliable colo with good uplink" literally failed on my company last week, went hard down, couldn't get it back up. Not only that server but the backup server too. 3 day outage for one of the company's biggest products. But I'm sure you'll claim my real world issue is somehow invalid. If we had just been "more perfect", used "better hardware", "a better colo", or had "better people", nothing bad would have happened.

There is lot of statistical and empirical data on this topic - MTBF estimates from vendors (typically 100k - 1m+ hours), Backblaze and Google drive failure data (~1-2% annual failure rate), IEEE and others. With N+1 redundancy (backup servers/RAID + spare drives) and proper design and change control processes, operational failures should be very rare.

With cloud hardware issues are just the start - yes you MUST "plan for failure", leveraging load balancers, auto scaling, cloudwatch, and dozens of other proprietary dials and knobs. However, you must also consider control plane, quotas, capacity, IAM, spend, and other non-hardware breaking points.

You're autoscaling isn't working - is the AZ out of capacity, did you hit a quota limit, run out of IPv4s, or was an AMI inadvertently removed? Your instance is unable to write to S3 - is the metadata service being flakey (for your IAM role), or is it due to an IAM role / S3 policy change? Your Lambda function is failing - did it hit a timeout, or exhaust the (512MB) temp storage? Need help diagnosing an issue - what is your paid support tier - submit a ticket and we'll get back to you sometime in the 24 hours.

> The number of failures you can have on-prem is insane. Hardware can fail for all kinds of reasons (you must know this)

Cloud vendors are not immune from hardware failure. What do you think their underlying infrastructure runs on, some magical contraption made from Lego bricks, Swiss chocolate, and positive vibes?

It's the same hardware, prone to the same failures. You've just outsourced worrying about it.

The hardware is prone to the same failures, but the customers rarely experience them, because they handle it for you. EBS means never worrying about disks. S3 means never worrying about objects. EC2 ASG means never worrying about failed machines/VMs. Multi-AZ means never worrying about an entire datacenter going down.

Yes, you pay someone else to worry about it. That's kinda the whole idea.

ok...?

But, it comes at a cost. And that cost is significant. Like magnitudes significant.

At what point does it become cheaper to hire an infra engineer? Let's see.

In the US a good infra engineer might cost you $150K/yr all in. That's not taking into account freelancers/contractors who can do it for less.

That's ~$12K/mo.

That's a lot of compute on AWS...but that's not the end of the story. Ever try getting data OUT of AWS? Yeah, those egress costs are not chump change. But that's not even the end of it.

The more important question is, what's the ratio of hosting/cloud costs to overall revenue? If colo/owned DC will yield better financials over ~few quarters, you'd be bananas as a CTO to recommend the cloud.

The bigger cost is what will happen to your business when you're hard-down for a week because all your SQL servers are down, and you don't have spares, and it will take a week to ship new servers and get them racked. Even if you think you could do that very fast, there is no guarantee. I've seen Murphy's Law laugh in the face of assumptions and expectations too many times.

But let's not just make vague claims. Everybody keeps saying AWS is more expensive, right? So let's look at one random example: the cost of a server in AWS vs buying your own server in a colo.

  AWS:
    1x c6g.8xlarge (32-vCPU, 64GB RAM, us-east-2, Reserved Instance plan @ 3yrs)
       Cost up front: $5,719
       Cost over 3 years: $11,437 ($158.85/month + $5,719 upfront)

  On-prem:
    1x Supermicro 1U WIO A+ Server (AS -1115SV-WTNRT), 1x AMD EPYC™ 8324P Processor 32-Core 2.65GHz 128MB Cache (180W), 2x 32GB DDR5 5600MHz ECC RDIMM Server Memory, 2x 240GB 2.5" PM893 SATA 6Gb/s Solid State Drive (1 x DWPD), 3 Years Parts and Labor + 2 Years of Cross Shipment, MCP-290-00063-0N - Supermicro 1U Rail Kit (Included), 2 10GbE RJ45 Ports : $4,953.40
    1x Colo shared rack 1U 2-PS @ 120VAC: $120/month (100Mbps only)
      Cost up front: $4,953.40 (before shipping & tax)
      Cost over 3 years: $9,273 (minimum)
So, yes, the AWS server is double the cost (not an order of magnitude) of a ServerMicro (& this varies depending on configuration). But with colocation fees, remote hands fees, faster internet speeds, taxes, shipping, and all the rest of the nickle-and-diming, the cost of a single server in a colo is almost the same as AWS. Switch to a full rack, buy the networking gear, remote hands gear, APCs, etc that you'll probably want, and it's way, way more expensive to colo. In this one example.

Obviously, it all depends on a huge number of factors. Which is why it's better not to just take the copious number of "we do on-prem and everything is easy and cheap" stories at face value. Instead one should do a TCO analysis based on business risk, computing requirements, and the non-monetary costs of running your own micro-datacenter.

> The bigger cost is what will happen to your business when you're hard-down for a week because all your SQL servers are down, and you don't have spares, and it will take a week to ship new servers and get them racked. Even if you think you could do that very fast, there is no guarantee. I've seen Murphy's Law laugh in the face of assumptions and expectations too many times.

Lets ignore the loaded, cherry picked situation of no redundancy, no spares, and no warranty service. Because this is all magically hard since cloud providers appeared even though many of us did this, and have done this for years....

There is nothing stopping an on-prem user from renting a replacement from a cloud provider while waiting for hardware to show up. That's a good logical use case for the cloud we can all agree upon.

Next, your cost comparison isn't very accurate. One is isolated dedicated hardware, the other is shared. Junk fees such as egress, IPs, charges for access metal instances, IOPS provisioning for a database, etc will infest the AWS side. The performance of SAN vs local SSD is night and day for a database.

Finally, I can acquire that level of performance hardware much cheaper if I wanted to, order of magnitude is plausible and depends more on where it's located, colo costs, etc.

These servers are kinda tiny, and ignore the cost of storage. From the article, $252,000/y for 1 PB is crazy, and that's just storing it. There's also the CapEx vs OpEx aspect.

Yeah, if you don't have levels of redundancy, then you're pretty screwed. We could theoretically lose 2/3 of our systems and have sufficient capacity, because our metric is 2N primary plus N secondary, and we can run with half the racks switched off in the primary, or with the secondary entirely switched off, or (in theory, there's still some kinks with failover) with just secondary.

This. All of this and more. I've got friends who worked for a hosting providers who over the years have echoed this comment. It's endless.

> As the fortunes of AWS et al rose and rose and rose, I kept looking at their pricing at features and kept wondering what I was missing.

How do the availability/fault tolerance compare? If one of your geographical locations gets knocked out (fire, flood, network cutoff, war, whatever) what will the user experience look like, vs. what can cloud providers provide?

" teaching a course on how to do all this..." Can you provide some notice of this so I can schedule my vacation time to fully participate? Let me know when registration is open.

What is the software side of things like? Is your team managing these servers directly — or is it "cloud like" with containers (Kubernetes?), IaC tools, etc.

As a customer of Fastmail and a fan of your work at FastAI and FastHTML I feel a bit stupid now for not knowing you started Fastmail.

Now I'm wondering how much you'd look like tiangolo if you wore a moustache.

Now I wonder what he'd look like without the moustache :)

Jeremy is all the Fast things!

The whole push to the cloud has always fascinated me. I get it - most people aren't interested in babysitting their own hardware. On the other hand, a business of just about any size that has any reasonable amount of hosting is better off with their own systems when it comes purely to cost.

All the pro-cloud talking points are just that - talking points that don't persuade anyone with any real technical understanding, but serve to introduce doubt to non-technical people and to trick people who don't examine what they're told.

What's particularly fascinating to me, though, is how some people are so pro-cloud that they'd argue with a writeup like this with silly cloud talking points. They don't seem to care much about data or facts, just that they love cloud and want everyone else to be in cloud, too. This happens much more often on sites like Reddit (r/sysadmin, even), but I wouldn't be surprised to see a little of it here.

It makes me wonder: how do people get so sold on a thing that they'll go online and fight about it, even when they lack facts or often even basic understanding?

I can clearly state why I advocate for avoiding cloud: cost, privacy, security, a desire to not centralize the Internet. The reason people advocate for cloud for others? It puzzles me. "You'll save money," "you can't secure your own machines," "it's simpler" all have worlds of assumptions that those people can't possibly know are correct.

So when I read something like this from Fastmail which was written without taking an emotional stance, I respect it. If I didn't already self-host email, I'd consider using Fastmail.

There used to be so much push for cloud everything that an article like this would get fanatical responses. I hope that it's a sign of progress that that fanaticism is waning and people aren't afraid to openly discuss how cloud isn't right for many things.

"All the pro-cloud talking points are just that - talking points that don't persuade anyone with any real technical understanding,"

This is false. AWS infrastructure is vastly more secure than almost all company data centers. AWS has a rule that the same person cannot have logical access and physical access to the same storage device. Very few companies have enough IT people to have this rule. The AWS KMS is vastly more secure than what almost all companies are doing. The AWS network is vastly better designed and operated than almost all corporate networks. AWS S3 is more reliable and scalable than anything almost any company could create on their own. To create something even close to it you would need to implement something like MinIO using 3 separate data centers.

> AWS infrastructure is vastly more secure than almost all company data centers

Secure in what terms? Security is always about a threat model and trade-offs. There's no absolute, objective term of "security".

> AWS has a rule that the same person cannot have logical access and physical access to the same storage device.

Any promises they make aren't worth anything unless there's contractually-stipulated damages that AWS should pay in case of breach, those damages actually corresponding to the costs of said breach for the customer, and a history of actually paying out said damages without shenanigans. They've already got a track record of lying on their status pages, so it doesn't bode well.

But I'm actually wondering what this specific rule even tries to defend against? You presumably care about data protection, so logical access is what matters. Physical access seems completely irrelevant no?

> Very few companies have enough IT people to have this rule

Maybe, but that doesn't actually mitigate anything from the company's perspective? The company itself would still be in the same position, aka not enough people to reliably separate responsibilities. Just that instead of those responsibilities being physical, they now happen inside the AWS console.

> The AWS KMS is vastly more secure than what almost all companies are doing.

See first point about security. Secure against what - what's the threat model you're trying to protect against by using KMS?

But I'm not necessarily denying that (at least some) AWS services are very good. Question is, is that "goodness" required for your use-case, is it enough to overcome its associated downsides, and is the overall cost worth it?

A pragmatic approach would be to evaluate every component on its merits and fitness to the problem at hand instead of going all in, one way or another.

Physical access is pretty relevant if you could bribe an engineer to locate some valuable data's physical location, then go service the particular machine, copy the disk (during servicing "degraded hardware"), and thus exflitrate the data without any traces of a breach.

> They've already got a track record of lying on their status pages, so it doesn't bode well.

???

Physical access and logical root access can't hide things form each other. It takes both to hide an activity. If you only have one, then the other can always be used to uncover or detect in the first place, or at least diagnose after.

OTOH:

1. big clouds are very lucrative targets for spooks, your data seem pretty likely to be hoovered up as "bycatch" (or maybe main catch depending on your luck) by various agencies and then traded around as currency

2. you never hear about security probems (incidents or exposure) in the platforms, there's no transparency

3. better than most coporate stuff is a low bar

>3. better than most corporate stuff is a low bar

I think it's a very relevant bar, though. The top level commenter made points about "a business of just about any size", which seems pretty exactly aligned with "most corporate stuff".

If you don't want your data to be accessible to "various agencies", don't share it with corporations, full stop. Corporations are obliged by law to make it available to the agencies, and the agencies often overreach, while the corporations almost never mind the overreach. There are limitations for stuff like health or financial data, but these are not impenetrable barriers.

I would just consider all your hosted data to be easily available to any security-related state agency; consider them already having a copy.

That depends where it's hosted and how it's encrypted. Cloud hosts can just reach into your RAM, but dedicated server hosts would need to provision that before deploying the server, and colocation providers would need to take your server offline to install it.

Colocated / Dedicated is not Cloud, AFAICT. It's the "traditional hosting", not elastic / auto-scalable. You of course may put your own, highly tamper-proof boxes in a colocation rack, and be reasonably certain that any attempt to exfiltrate data from them won't be invisible to you.

By doing so, you share nothing with your hosting provider, you only rent rack space / power / connectivity.

And this is why I colocate, because all the data that hits my server is my data.

Sure I do have an AUP/T&C but without proper warrant no one is allowed to touch my server.

Case is monitored if it's opened. Encrypted on start-up, USB disabled. I just wished I had my own /24.

At least you can get your own /48, at least if you're under RIPE.

You should only do it if you expect to multihome though, or you're doing some experimentation that absolutely needs a PI address. Please don't pollute the default-free zone just for no reason.

There's much variation by jurisdiction. Eg US based big-cloud companies would seem more risky here if you're from a country with traditionally less invasive (and less funded) spooks.

4. we keep hitting hypervisor bugs and having to work around the fact that your software coexists on the same machine with 3rdparty untrusted software who might in fact be actively trying to attack you. All this silliness with encrypted memory buses and the various debilitating workarounds for silicon bugs.

So yes, the cloud is very secure, except for the very thing that makes it the cloud that is not secure at all and has just been papered over because questioning it means the business model is bust.

What hypervisor bugs are you referring to? AWS does offer bare metal servers.

Most corporations (which is the vast majority of cloud users) absolutely don't care about spooks, sadly enough. If that's the threat model, then it's a very very rare case to care about it. Most datacenters/corporations won't even fight or care about sharing data with local spooks/cops/three letter agencies. The actual threat is data leaks, security breaches, etc.

> you never hear about security probems (incidents or exposure) in the platforms

Except that one time...

https://www.seattlemet.com/news-and-city-life/2023/04/how-a-...

If I remember right, the attacker’s AWS employment is irrelevant - no privileged AWS access was used in that case. The attacker working for AWS was a pure coincidence, it could’ve been anyone.

[deleted]

one of my greatest learnings in life is to differentiate between facts and opinions- sometimes opinions are presented as facts and vice-versa. if you think about it- the statement "this is false" is a response to an opinion (presented as a fact) but not a fact. there is no way one can objectively define and defend what does "real technical understanding" means. the cloud space is vast with millions of people having varied understanding and thus opinions.

so let's not fight the battle that will never be won. there is no point in convincing pro-cloud people that cloud isn't the right choice and vice-versa. let people share stories where it made sense and where it didn't.

as someone who has lived in cloud security space since 2009 (and was founder of redlock - one of the first CSPMs), in my opinion, there is no doubt that AWS is indeed superiorly designed than most corp. networks- but is that you really need? if you run entire corp and LOB apps on aws but have poor security practices, will it be right decision? what if you have the best security engineers in the world but they are best at Cisco type of security - configuring VLANS and managing endpoints but are not good at detecting someone using IMDSv1 in ec2 exposed to the internet and running a vulnerable (to csrf) app?

when the scope of discussion is as vast as cloud vs on-prem, imo, it is a bad idea to make absolute statements.

Great points. Also if you end up building your apps as rube goldberg machines living up to "AWS Well Architected" criteria (indoctrinated by staff lots of AWS certifications, leading to a lot of AWS certified staff whose paycheck now depends on following AWS recommended practices) the complexity will kill your security, as nobody will understand the systems anymore.

about security, most businesses using AWS invest little to nothing in securing their software, or even adopt basic security practices for their employees

having the most secure data center doesn't matter if you load your secrets as env vars in a system that can be easily compromised by a motivated attacker

so i don't buy this argument as a general reason pro-cloud

This exactly, most leaks don't involve any physical access. Why bother with something hard when you can just get in through an unmaintained Wordpress/SharePoint/other legacy product that some department can't live without.

The cloud is someone else’s computer.

It’s like putting something in someone’s desk drawer under the guise of convenience at the expense of security.

Why?

Too often, someone other than the data owner has or can get access to the drawer directly or indirectly.

Also, Cloud vs self hosted to me is a pendulum that has swung back and forth for a number of reasons.

The benefits of the cloud outlined here are often a lot of open source tech packaged up and sold as manageable from a web browser, or a command line.

One of the major reasons the cloud became popular was networking issues in Linux to manage volume at scale. At the time the cloud became very attractive for that reason, plus being able to virtualize bare metal servers to put into any combination of local to cloud hosting.

Self-hosting has become easier by an order of magnitude or two for anyone who knew how to do it, except it’s something people who haven’t done both self-hosting and cloud can really discuss.

Cloud has abstracted away the cost of horsepower, and converted it to transactions. People are discovering a fraction of the horsepower is needed to service their workloads than they thought.

At some point the horsepower got way beyond what they needed and it wasn’t noticed. But paying for a cloud is convenient and standardized.

Company data centres can be reasonably secured using a number of PaaS or IaaS solutions readily available off the shelf. Tools from VMware, Proxmox and others are tremendous.

It may seem like there’s a lot to learn, except most problems they are new to someone have often been thought of a ton by both people with and without experience that is beyond cloud only.

> The cloud is someone else’s computer.

And in the case of AWS it is someone else's extremely well designed and managed computer and network.

Extremely well designed? I doubt it.

Usually the larger the company and the more mission critical the product: the worse the implementation.

Twitch source code (which, I guess counts as Amazon already), Disney leaks- and my own experience working with very large companies. (Nokia, Ubisoft, Facebook, Activision/Blizzard).

Your comment tells me you have never read any of AWS many documents about how they engineer their components. They put an huge amount of effort into it. AWS is much more reliable that Azure. They have built the largest and most reliable storage system in the world with S3. AWS has stated that some customers have S3 buckets using over 1 million hard drives. Netflix relies heavily on AWS for its streaming services. Lyft runs its ride-sharing platform on AWS. Capital One migrated its entire infrastructure to AWS. Slack relies on AWS for its messaging platform. GE utilizes AWS for industrial IoT (Internet of Things) solutions, predictive maintenance, and data analytics. Twitch streams video to 31 million viewers from AWS.

https://www.amazon.science/publications/cloud-resource-prote...

https://www.amazon.science/tag/formal-verification

https://aws.amazon.com/security/provable-security/resources/

https://www.amazon.science/blog/custom-policy-checks-help-de...

https://www.amazon.science/publications/formal-verification-...

AWS is an industry leader in using formal methods and automated reasoning to prove the security and reliability of critical software and detect insecure configurations

[deleted]

Generally I look to people who could build an AWS on the value of it or doing it themselves because they can do both.

Happy to hear more.

One of the ways the NSA and security services get so much intelligence on targets isn't by direct decryption of what they are storing in data or listening in. A great deal with their intelligence is simply metadata intelligence. They watch what you do. They watch the amount of data you transport. They watch your patterns of movement.

So even if eight of us is providing direct security and encryption in the sense of what most security professionals are concerned with key strength etc etc etc, Eddie of us still has a great deal about of information about what you do, because they get to watch how much data moves from where to where and other information about what those machines are

> The cloud is someone else’s computer

Isn’t it more like leasing in a public property? Meaning it is yours as long as you are paying the lease? Analogous to renting an apartment instead of owning a condo?

Not at all. You can inspect the apartment you rent. The cloud is totally opaque in that regard.

Totally opaque is a really nice way to describe it.

Nope. It's literally putting private data in a shared drawer in someone else's desk where you have your area of the drawer.

Literally?

I would just like to point out that most of us who have ever had a job at an office, attended an academic institution, or lived in rented accommodation have kept stuff in someone else’s desk drawer from time to time. Often a leased desk in a building rented from a random landlord.

Keeping things in someone else’s desk drawer can be convenient and offer a sufficient level of privacy for many purposes.

And your proposed alternative to using ‘someone else’s desk drawer’ is, what, make your own desk?

I guess, since I’m not a carpenter, I can buy a flatpack desk from ikea and assemble it and keep my stuff in that. I’m not sure that’s an improvement to my privacy posture in any meaningful sense though.

It doesn’t have to be entirely literal, or not literal at all.

A single point of managed/shared access to a drawer doesn’t fit all levels of data sensitivity and security.

I understand this kind of wording and analogy might be triggering for the drive by down voters.

A comment like the above though allows both people to openly consider viewpoints that may not be theirs.

For me it shed light on something simpler.

Shared access to shared infrastructure is not always secure as we want to tell ourselves. It’s important to be aware when it might be security through abstraction.

The dual security and convenience of self-hosting IaaS and PaaS even at a dev, staging or small scale production has improved dramatically, and allows for things to be built in a cloud agnostic way to allow switching clouds to be much easier. It can also easily build a business case to lower cloud costs. Still, it doesn’t have to be for everyone either, where the cloud turns to be everything.

A small example? For a stable homeland - their a couple of usff small servers running proxmox or something residential fibre behind a tailscale or cloudflare funnel and compare the cost for uptime. It’s surprising how much time servers and apps spend idling.

Life and the real world is more than binary. Be it all cloud or no cloud.

> Keeping things in someone else’s desk drawer can be convenient and offer a sufficient level of privacy for many purposes.

Too torture a metaphor to death, are you going to keep your bank passwords in somebody else's desk drawer? Are you going to keep 100 million people's bank passwords in that drawer?

> I guess, since I’m not a carpenter, I can buy a flatpack desk from ikea and assemble it and keep my stuff in that. I’m not sure that’s an improvement to my privacy posture in any meaningful sense though.

If you're not a carpenter I would recommend you stay out of the business of building safe desk drawers all together. Although you should probably still be able to recognize that the desk drawer you own, that is inside your own locked house is a safer option then the one at the office accessible by any number of people.

If you have something physical of equivalent value to 100 million people's bank passwords, you may well not want to risk keeping it in a desk drawer at all, and instead want to look into renting a nice secure drawer from someone else to keep it in. That would be a safety deposit box.

Which I would argue is rather more like what cloud providers offer than 'someone else's desk drawer' is.

AWS is so complicated, we usually find more impactful permission problems than in any company using their own hardware

The other part is that when us-east-1 goes down, you can blame AWS, and a third of your customer's vendors will be doing the same. When you unplug the power to your colo rack while installing a new server, that's on you.

It's not always a full availability zone going down that is the problem. Also, despite the "no one ever got fired for buying Microsoft" logic, in practice I've never actually found stakeholders to be reassured by "its AWS and everyone is affected" when things are down. People want things back up and they want some informed answers about when that might happen, not "ehh its AWS, out of our control".

When there's little trust between the business and IT, both are incentivized to move to the cloud.

It's harder to build trust than the opposite.

OTOH, when your company's web site is down you can do something about it. When the CEO asks about it, you can explain why its offline and more importantly what is being done to bring it back.

The equivalent situation for those who took a cloud based approach is often... ¯\_(ツ)_/¯

The more relevant question is whether my efforts to do something lead to a better and faster result than my cloud providers efforts to do something. I get it - it feels powerless to do nothing, but for a lot of organizations I’ve seen the average downtime would still be higher.

I worked in IT for a state government and they had a partial outage of their Exchange server that lasted over 2 weeks. It triggered a full migration to Exchange online.

When AWS goes down you can tell your boss that dozens of people are working to get it back up.

With the cloud, in a lot of cases you can have additional regions that incur very little cost as they scale dynamically with traffic. It’s hard to do that with on-prem. Also many AWS services come cross-AZ (AZ is a data center), so their arch is more robust than a single Colo server even if you’re in a single region.

Cross region from on-prem to the cloud for a website is easy. In fact, as long as you don't buy into "cloud native" ("cloud lock-in"?), it's probably more cost effective than two on-prem regions or two cloud regions.

Hey boss, I go to sleep now, site should be up anytime. Cheers

Making API calls from a VM on shared hardware to KMS is vastly more secure than doing AES locally? I'm skeptical to say the least.

Encrypting data is easy, securely managing keys is the hard part. KMS is the Key Management Service. And AWS put a lot of thought and work into it.

https://docs.aws.amazon.com/kms/latest/cryptographic-details...

KMS access is granted by either environment variables or by authorizing the instance itself. Either way, if the instance is compromised, then so is access to KMS. So unless your threat model involves preventing the government from looking at your data through some theoretical sophisticated physical attack, then your primary concerns are likely the same as running a box in another physically secure location. So the same rules of needing to design your encryption scheme to minimize blowout from a complete hostile takeover still apply.

An attacker gaining temporary capability to encrypt/decrypt data through a compromised instance is painful. An attacker gaining a copy of a private key is still an entirely different world of pain.

Painful is an understatement. Keys for sensitive customer data should be derived from customer secrets either way. Almost nobody does that though, because it requires actual forethought. Instead they just slap secrets in KMS and pretend it's better than encrypted environment variables or other secrets services. If an attacker can read your secrets with the same level of penetration into your system, then it's all the same security wise.

There are many kinds of secrets that are used for purposes where they cannot be derived from customer secrets, and those still need to be secured. TLS private keys for example.

I do disagree on the second part - there’s a world of a difference whether an attacker obtains a copy of your certificates private key and can impersonate you quietly or whether they gain the capability to perform signing operations on your behalf temporarily while they maintain access to a compromised instance.

It's all unencrypted secrets from perspective of an attacker. If they somehow already have enough access to read your environment variables, then they can definitely access secrets manager records authorized for that service. By all means put secrets management in a secondary service to prevent leaking keys, but you don't need a cloud service to do that.

It's the same pain, since the resolution is the exact same. You have to rotate.

It's now been two years since I used KMS, but at the time it seemed little more than S3 API interface with Twitter size limitations

Fundamentally why would KMS be more secure than S3 anyway? Both ultimately have the same fundamental security requirements and do the same thing.

So the big whirlydoo is KMS has hardware keygen. im sorry, that sounds like something almost guaranteed to have nsa backdoor, or has so much nsa attention it has been compromised.

If your threat model is the NSA and you’re worried about backdoors then don’t use any cloud provider?

Maybe I’m just jaded from years doing this, but two things have never failed me for bringing me peace of mind in the infrastructure/ops world:

1. Use whatever your company has already committed to. Compare options and bring up tradeoffs when committing to a cloud-specific service(ie. AWS Lambdas) versus more generic solutions around cost, security and maintenance.

2. Use whatever feels right to you for anything else.

Preventing the NSA from cracking into your system is a fun thought exercise, but life is too short to make that the focus of all your hosting concerns

I guess since this is Hacker News, I shouldn’t be surprised that there are a bunch of commenters who are absolutely certain they and their random colo provider will do a better job of defeating the almighty NSA than AWS.

You won’t even know when they serve your Colo provider with a warrant under gag order, and I’m certain they’ll be able to bypass your own “tamper-proof” protections.

If the NSA is part of your threat model then good luck. I'm not sure any single company could withstand the NSA really trying to hack them for years. The threat of possible NSA backdoors is not a reasonable argument against a cloud provider as the NSA could also have backdoors in every CPU AMD and Intel and AWS makes.

You can securely store your asymmetric key for signing, but if I remember correctly the logs are pretty useless, basically you just know the key was used to make a signature, no option to log the signature or additional metadata, which would help auditing after an account/app compromise.

Taking for granted all these points. How many businesses out there actually need this kind of security/scalability, compared to how many use cloud services and pay extra cost for something they don't need?

From a critical perspective, your comment made me think about the risks posed by rogue IT personnel, especially at scale in the cloud. For example, Fastmail is a single point of failure as a DoS target, whereas attacking an entire datacenter can impact multiple clients simultaneously. It all comes down to understanding the attack vectors.

Cloud providers are very big targets but have enormous economic incentive to be secure and thus have very large teams of very competent security experts.

You can have full security competence but be a rogue actor at the same time.

You can also have rogue actors in your company, you don’t need 3rd parties for that

And I bet AWS is better at detecting them.

That doesn't sum up my comments in the thread. A rogue actor in a datacenter could attack zillions of companies at the same time while rogue actors in a single company only once.

And I bet AWS is also better at detecting rogue actors.

I don't understand what this is trying to say.

<citations needed>

AWS hires the same cretins that inhabit every other IT department, they just usually happen to be more technically capable. That doesn't make them any more or less trustworthy or reliable.

"cretins"?

This trivializes some real issues.

The biggest problem the cloud solves is hardware supply chain management. To realize the full benefits of doing your own build at any kind of non-trivial scale you will need to become an expert in designing, sourcing, and assembling your hardware. Getting hardware delivered when and where you need it is not entirely trivial -- components are delayed, bigger customers are given priority allocation, etc. The technical parts are relatively straightforward; managing hardware vendors, logistics, and delivery dates on an ongoing basis is a giant time suck. When you use the cloud, you are outsourcing this part of the work.

If you do this well and correctly then yes, you will reduce costs several-fold. But most people that build their own data infrastructure do a half-ass job of it because they (understandably) don't want to be bothered with any of these details and much of the nominal cost savings evaporate.

Very few companies do security as well as the major cloud vendors. This isn't even arguable.

On the other hand, you will need roughly the same number of people for operations support whether it is private data infrastructure or the cloud, there is little or no savings to be had here. The fixed operations people overhead scales to such a huge number of servers that it is inconsequential as a practical matter.

It also depends on your workload. The types of workloads that benefit most from private data infrastructure are large-scale data-intensive workloads. If your day-to-day is sling tens or hundreds of PB of data for analytics, the economics of private data infrastructure is extremely compelling.

> managing hardware vendors, logistics, and delivery dates on an ongoing basis is a giant time suck

You can rent servers and it's still not cloud.

I'm pretty neutral and definitely see the value of cloud. But a lot of cloud proponents seem to lack, what to me, seems like basic knowledge.

> don't want to be bothered with any of these details

Isn't the job to be bothered with the details? 90% of employment for most people is doing shit you don't really want to be doing, but that's the job.

My firm belief after building a service at scale (tens of millions of end users, > 100K tps) is that AWS is unbeatable. We don’t even think about building our own infrastructure. There’s no way we could ever make it reliable enough, secure enough, and future-proof enough to ever pay back the cost difference.

Something people neglect to mention when they tout their home grown cloud is that AWS spends significant cycles constantly eliminating technical debt that would absolutely destroy most companies - even ones with billion dollar services of their own. The things you rely on are constantly evolving and changing. It’s hard enough to keep up at the high level of a SaaS built on top of someone else’s bulletproof cloud. But imagine also having to keep up with the low level stuff like networking and storage tech?

No thanks.

I've done it. It's nowhere as complicated as you make it seem. It definitely doesn't kill - no more than failing to manage your software tech debt. In fact, the latter is both harder keep up with and more risky, because it changes faster than the low level stuff, to support business needs.

With the cloud you have IT/DevOps deal only with scaling the software components of the infra. When doing on-prem they take on the physical layer as well. Do you have enough trust in them to scale the physical part where needed?

...and power, backup power, HVAC, physical security...

Or buy colo space and they do it for you. It's not all cloud vs owning a datacenter - There's a thousand shades of ltgrey

<ctoHatTime> Dunno man, it's really really easy to set up an S3 and use it to share datasets for users authorized with IAM....

And IAM and other cloud security and management considerations is where the opex/capex and capability argument can start to break down. Turns out, the "cloud" savings comes from not having capabilities in house to manage hardware. Sometimes, for most businesses, you want some of that lovely reliability.

(In short, I agree with you, substantially).

Like code. It is easy to get something basic up, but substantially more resources are needed for non-trivial things.

I feel like IAM may be the sleeper killer-app of cloud.

I self-host a lot of things, but boy oh boy if I were running a company it would be a helluvalotta work to get IAM properly set up.

I strongly agree with this and also strongly lament it.

I find IAM to be a terrible implementation of a foundationally necessary system. It feels tacked on to me, except now it's tacked onto thousands of other things and there's no way out.

like terraform! isn't pulumi 100% better but there's no way out of terraform.

That's essentially why "platform engineering" is a hot topic. There are great FOSS tools for this, largely in the Kubernetes ecosystem.

To be clear, authentication could still be outsourced, but authorizing access to (on-prem) resources in a multi-tenant environment is something that "platforms" are frequently designed for.

> All the pro-cloud talking points... don't persuade anyone with any real technical understanding

This is a very engineer-centric take. The cloud has some big advantages that are entirely non-technical:

- You don't need to pay for hardware upfront. This is critical for many early-stage startups, who have no real ability to predict CapEx until they find product/market fit.

- You have someone else to point the SOC2/HIPAA/etc auditors at. For anyone launching a company in a regulated space, being able to checkbox your entire infrastructure based on AWS/Azure/etc existing certifications is huge.

You can over-provision your own baremetal resources 20x and it will be still cheaper than cloud. The capex talking point is just that, a talking point.

As an early-stage startup?

Your spend in the first year on AWS is going to be very close to zero for something like a SaaS shop.

Nor can you possibly scale in-house baremetal fast enough if you hit the fabled hockey stick growth. By the time you sign a colocation contract and order hardware, your day in the sun may be over.

> You have someone else to point the SOC2/HIPAA/etc auditors at.

I would assume you still need to point auditors to your software in any case

You do, which makes it very nice to not have to answer questions about the physical security of your servers.

Cloud expands the capabilities of what one team can manage by themselves, enabling them to avoid a huge amount of internal politics.

This is worth astronomical amounts of money in big corps.

I’m not convinced this is entirely true. The upfront cost if you don’t have the skills, sure – it takes time to learn Linux administration, not to mention management tooling like Ansible, Puppet, etc.

But once those are set up, how is it different? AWS is quite clear with their responsibility model that you still have to tune your DB, for example. And for the setup, just as there are Terraform modules to do everything under the sun, there are Ansible (or Chef, or Salt…) playbooks to do the same. For both, you _should_ know what all of the options are doing.

The only way I see this sentiment being true is that a dev team, with no infrastructure experience, can more easily spin up a lot of infra – likely in a sub-optimal fashion – to run their application. When it inevitably breaks, they can then throw money at the problem via vertical scaling, rather than addressing the root cause.

I think this is only true for teams and apps of a certain size.

I've worked on plenty of teams with relatively small apps, and the difference between:

1. Cloud: "open up the cloud console and start a VM"

2. Owned hardware: "price out a server, order it, find a suitable datacenter, sign a contract, get it racked, etc."

Is quite large.

#1 is 15 minutes for a single team lead.

#2 requires the team to agree on hardware specs, get management approval, finance approval, executives signing contracts. And through all this you don't have anything online yet for... weeks?

If your team or your app is large, this probably all averages out in favor of #2. But small teams often don't have the bandwidth or the budget.

I work for a 50 person subsidiary of a 30k person organisation. I needed a domain name. I put in the purchase request and 6 months later eventually gave up, bought it myself and expensed it.

Our AWS account is managed by an SRE team. It’s a 3 day turnaround process to get any resources provisioned, and if you don’t get the exact spec right (you forgot to specify the iops on the volume? Oops) 3 day turnaround. Already started work when you request an adjustment? Better hope as part of your initial request you specified backups correctly or you’re starting again.

The overhead is absolutely enormous, and I actually don’t even have billing access to the AWS account that I’m responsible for.

> 3 day turnaround process to get any resources provisioned

Now imagine having to deal with procurement to purchase hardware for your needs. 6 months later you have a server. Oh you need a SAN for object storage? There goes another 6 months.

At a previous job we had some decent on prem resources for internal services. The SRE guys had a bunch of extra compute and you would put in a ticket for a certain amount of resources (2 cpu, SSD, 8GB memory x2 on different hosts). There wasn’t a massive amount of variability between the hardware, and you just requested resources to be allocated from a bunch of hypervisors. Turnaround time was about 3 days too. Except, you were t required to be self sufficient in AWS terminology to request exactly what you needed .

> Our AWS account is managed by an SRE team.

That's an anti-pattern (we call it "the account") in the AWS architecture.

AWS internally just uses multiple accounts, so a team can get their own account with centrally-enforced guardrails. It also greatly simplifies billing.

That’s not something that I have control over or influence over.

Manageability of cloud without a dedicated resource is a form of resource creep, and shadow labour costs that aren’t factored in.

How many things don’t end up happening because of this? When they need a sliver of resources in the start?

You're assuming that hosting something in-house implies that each application gets its own physical server.

You buy a couple of beastly things with dozens of cores. You can buy twice as much capacity as you actually use and still be well under the cost of cloud VMs. Then it's still VMs and adding one is just as fast. When the load gets above 80% someone goes through the running VMs and decides if it's time to do some house cleaning or it's time to buy another host, but no one is ever waiting on approval because you can use the reserve capacity immediately while sorting it out.

The SMB I work for runs a small on-premise data center that is shared between teams and projects, with maybe 3-4 FTEs managing it (the respective employees also do dev and other work). This includes self-hosting email, storage, databases, authentication, source control, CI, ticketing, company wiki, chat, and other services. The current infrastructure didn’t start out that way and developed over many years, so it’s not necessarily something a small startup can start out with, but beyond a certain company size (a couple dozen employees or more) it shouldn’t really be a problem to develop that, if management shares the philosophy. I certainly find it preferable culturally, if not technically, to maximize independence in that way, have the local expertise and much better control over everything.

One (the only?) indisputable benefit of cloud is the ability to scale up faster (elasticity), but most companies don’t really need that. And if you do end up needing it after all, then it’s a good problem to have, as they say.

Your last paragraph identifies the reason that running their own hardware makes sense for Fastmail. The demand for email is pretty constant. Everyone does roughly the same amount of emailing every day. Daily load is predictable, and growth is predictable.

If your load is very spiky, it might make more sense to use cloud. You pay more for the baseline, but if your spikes are big enough it can still be cheaper than provisioning your own hardware to handle the highest loads.

Of course there's also possibly a hybrid approach, you run your own hardware for base load and augment with cloud for spikes. But that's more complicated.

I’ve never worked at a company with these particular problems, but:

#1: A cloud VM comes with an obligation for someone at the company to maintain it. The cloud does not excuse anyone from doing this.

#2: Sounds like a dysfunctional system. Sure, it may be common, but a medium sized org could easily have some datacenter space and allow any team to rent a server or an instance, or to buy a server and pay some nominal price for the IT team to keep it working. This isn’t actually rocket science.

Sure, keeping a fifteen year old server working safely is a chore, but so is maintaining a fifteen-year-old VM instance!

The cloud is someone else’s computer.

Having redirected of a vm provider or installing a hyper visor on equipment is another thing.

Obligation? Far from it. I've worked at some poorly staffed companies. Nobody is maintaining old VMs or container images. If it works, nobody touches it.

I worked at a supposedly properly staffed company that had raised 100's of millions in investment, and it was the same thing. VMs running 5 year old distros that hadn't been updated in years. 600 day uptimes, no kernel patches, ancient versions of Postgres, Python 2.7 code everywhere, etc. This wasn't 10 years ago. This was 2 years ago!

There is a large gap between "own the hardware" and "use cloud hosting". Many people rent the hardware, for example, and you can use managed databases which is one step up than "starting a vm".

But your comparison isn't fair. The difference between running your own hardware and using the cloud (which is perhaps not even the relevant comparison but let's run with it) is the difference between:

1. Open up the cloud console, and

2. You already have the hardware so you just run "virsh" or, more likely, do nothing at all because you own the API so you have already included this in your Ansible or Salt or whatever you use for setting up a server.

Because ordering a new physical box isn't really comparable to starting a new VM, is it?

I've always liked the theory of #2, I just haven't worked anywhere yet that has executed it well.

Before the cloud, you could get a VM provisioned (virtual servers) or a couple of apps set up (LAMP stack on a shared host ;)) in a few minutes over a web interface already.

"Cloud" has changed that by providing an API to do this, thus enabling IaC approach to building combined hardware and software architectures.

3. "Dedicated server" at any hosting provider

Open their management console, press order now, 15 mins later get your server's IP address.

For purposes of this discussion, isn't AWS just a very large hosting provider?

I.e. most hosting providers give you the option for virtual or dedicated hardware. So does Amazon (metal instances).

Like, "cloud" was always an ill-defined term, but in the case of "how do I provision full servers" I think there's no qualitative difference between Amazon and other hosting providers. Quantitative, sure.

> Amazon (metal instances)

But you still get nickel & dimed and pay insane costs, including on bandwidth (which is free in most conventional hosting providers, and overages are 90x cheaper than AWS' costs).

Qualitatively, AWS is greedy and nickle and dime you to death. Their Route53 service doesn't even have all the standard DNS options I need and I can get everywhere else or even on my own running bind9. I do not use IPv6 for several reasons, when AWS decided charge for IPv4, I went looking elsewhere to get my VM's.

I can't even imagine how much the US Federal Government is charging American taxpayers to pay AWS for hosting there, it has to be astronomical.

Out of curiosity, which DNS record types do you need that Route53 doesn't support?

More like 15 seconds.

You have omitted the option between the two, which is renting a server. No hardware to purchase, maintain or set up. Easily available in 15 minutes.

While I did say "VM" in my original comment, to me this counts as "cloud" because the UI is functionally the same.

You gave me flashbacks to a far worse bureaucratic nightmare with #2 in my last job.

I supported an application with a team of about three people for a regional headquarters in the DoD. We had one stack of aging hardware that was racked, on a handshake agreement with another team, in a nearby facility under that other team's control. We had to periodically request physical access for maintenance tasks and the facility routinely lost power, suffered local network outages, etc. So we decided that we needed new hardware and more of it spread across the region to avoid the shaky single-point-of-failure.

That began a three year process of: waiting for budget to be available for the hardware / license / support purchases; pitching PowerPoints to senior management to argue for that budget (and getting updated quotes every time from the vendors); working out agreements with other teams at new facilities to rack the hardware; traveling to those sites to install stuff; and working through the cybersecurity compliance stuff for each site. I left before everything was finished, so I don't know how they ultimately dealt with needing, say, someone to physically reseat a cable in Japan (an international flight away).

There is. Middle ground between the extremes of those pendulums of all cloud or physical metal.

You can start with using a cloud only for VMs and only run services on it using IaaS or PaaS. Very serviceable.

You can get pretty far without any of that fancy stuff. You can get plenty done by using parallel-ssh and then focusing on the actual thing you develop instead of endless tooling and docker and terraform and kubernetes and salt and puppet and ansible. Sure, if you know why you need them and know what value you get from them OK. But many people just do it because it's the thing to do...

Do you need those tools? It seems that for fundamental web hosting, you need your application server, nginx or similar, postgres or similar, and a CLI. (And an interpreter etc if your application is in an interpreted lang)

I suppose that depends on your RTO. With cloud providers, even on a bare VM, you can to some extent get away with having no IaC, since your data (and therefore config) is almost certainly on networked storage which is redundant by design. If an EC2 fails, or even if one of the drives in your EBS drive fails, it'll probably come back up as it was.

If it's your own hardware, if you don't have IaC of some kind – even something as crude as a shell script – then a failure may well mean you need to manually set everything up again.

All EBS volumes except io2 have advertised durability of 99.8%, which is pretty low, so don't count it in the magic networked storage category.

Get two servers (or three, etc)?

Well, sure – I was trying to do a comparison in favor of cloud, because the fact that EBS Volumes can magically detach and attach is admittedly a neat trick. You can of course accomplish the same (to a certain scale) with distributed storage systems like Ceph, Longhorn, etc. but then you have to have multiple servers, and if you have multiple servers, you probably also have your application load balanced with failover.

For fundamentals, that list is missing:

- Some sort of firewall or network access control. Being able to say "allow http/s from the world (optionally minus some abuser IPs that cause problems), and allow SSH from developers (by IP, key, or both)" at a separate layer from nginx is prudent. Can be ip/tables config on servers or a separate firewall appliance.

- Some mechanism of managing storage persistence for the database, e.g. backups, RAID, data files stored on fast network-attached storage, db-level replication. Not losing all user data if you lose the DB server is table stakes.

- Something watching external logging or telemetry to let administrators know when errors (e.g. server failures, overload events, spikes in 500s returned) occur. This could be as simple as Pingdom or as involved as automated alerting based on load balancer metrics. Relying on users to report downtime events is not a good approach.

- Some sort of CDN, for applications with a frontend component. This isn't required for fundamental web hosting, but for sites with a frontend and even moderate (10s/sec) hit rates, it can become required for cost/performance; CDNs help with egress congestion (and fees, if you're paying for metered bandwidth).

- Some means of replacing infrastructure from nothing. If the server catches fire or the hosting provider nukes it, having a way to get back to where you were is important. Written procedures are fine if you can handle long downtime while replacing things, but even for a handful of application components those procedures get pretty lengthy, so you start wishing for automation.

- Some mechanism for deploying new code, replacing infrastructure, or migrating data. Again, written procedures are OK, but start to become unwieldy very early on ('stop app, stop postgres, upgrade the postgres version, start postgres, then apply application migrations to ensure compatibility with new version of postgres, then start app--oops, forgot to take a postgres backup/forgot that upgrading postgres would break the replication stream, gotta write that down for net time...').

...and that's just for a very, very basic web hosting application--one that doesn't need caches, blob stores, the ability to quickly scale out application server or database capacity.

Each of those things can be accomplished the traditional way--and you're right, that sometimes that way is easier for a given item in the list (especially if your maintainers have expertise in that item)! But in aggregate, having a cloud provider handle each of those concerns tends to be easier overall and not require nearly as much in-house expertise.

I have never ever worked somewhere with one of these "cloud-like but custom on our own infrastructure" setups that didn't leak infrastructure concerns through the abstraction, to a significantly larger degree than AWS.

I believe it can work, so maybe there are really successful implementations of this out there, I just haven't seen it myself yet!

You are focusing on technology. And sure of course you can get most of the benefits of AWS a lot cheaper when self-hosting.

But when you start factoring internal processes and incompetent IT departments, suddenly that's not actually a viable option in many real-world scenarios.

Exactly. With the cloud you can suddenly do all the things your tyrannical Windows IT admin has been saying are impossible for the last 30 years.

It is similar to cooking at home vs ordering cooked food everyday. If some guarantees the taste & quality people would happy to outsource it.

All of that is... completely unrelated to the GP's post.

Did you reply to the right comment? Do you think "politics" is something you solve with Ansible?

> Cloud expands the capabilities of what one team can manage by themselves, enabling them to avoid a huge amount of internal politics.

It's related to the first part. Re: the second, IME if you let dev teams run wild with "managing their own infra," the org as a whole eventually pays for that when the dozen bespoke stacks all hit various bottlenecks, and no one actually understands how they work, or how to troubleshoot them.

I keep being told that "reducing friction" and "increasing velocity" are good things; I vehemently disagree. It might be good for short-term profits, but it is poison for long-term success.

> I keep being told that "reducing friction" and "increasing velocity" are good things

As always, good rules are good, and bad rules are bad.

Like most people on the internet, you are assuming only one of those sets exist. But you are just assuming a different set from everybody that you are criticizing.

Our big company locked all cloud resources behind a floating/company-wide DevOps team (git and CI too). We have an old on-prem server that we jealously guard because it allows us to create remotes for new git repos and deploy prototypes without consulting anyone.

(To be fair, I can see why they did it - a lot of deployments were an absolute mess before.)

This is absolutely spot on.

What do you mean, I can't scale up because I've used my hardware capex budget for the year?

I have said for years the value of cloud is mainly its api, thats the selling point in large enterprise.

Self-hosted software also has APIs, and Terraform libraries, and Ansible playbooks, etc. It’s just that you have to know what it is you’re trying to do, instead of asking AWS what collection of XaaS you should use.

But isn't using Fastmail akin to using a cloud provider (managed email vs managed everything else)? They are similarly a service provider, and as a customer, you don't really care "who their ISP is?"

The discussion matters when we are talking about building things: whether you self-host or use managed services is a set of interesting trade-offs.

Yes, FastMail is a SAAS. But there adepts of a religion which would tell you that companies like FastMail should be built on top of AWS and it is the only true way. It is good to have some counter narrative to this.

Being cloud compatible (packaged well) can be as important as being cloud-agnostic (work on any cloud).

Too many projects become beholden to one cloud.

Even as an Anti-Cloud ( Or more accurately Anti-everything Cloud ) person I still think there are many benefits to cloud. Just most of the them are over sold and people dont need it.

Number one is company bureaucracy and politics. No one wants to beg another person or department, go on endless meetings just to have extra hardware provisioned. For engineers that alone is worth perhaps 99% of all current cloud margins.

Number two is also company bureaucracy and politics. CFOs dont like CapX. Turning it into OpeX makes things easier for them. Along with end of year company budget turning into Cloud credits for different departments. Especially for companies with government fundings.

Number three is really company bureaucracy and politics. Dealing with either Google, AWS and Microsoft meant you no longer have to deal with dozens of different vendors from on server, networking hardware, software licenses etc. Instead it is all pre-approved into AWS, GCP or Azure. This is especially useful for things that involves Government contracts or fundings.

There are also things like instant worldwide deployment. You can have things up and running in any regions within seconds. And useful when you have site that gets 10 to 1000x the normal traffic from time to time.

But then a lot of small business dont have these sort of issues. Especially non-consumer facing services. Business or SaaS are highly unlikely to get 10x more customers within short period of time.

I continue to wish there is a middle ground somewhere. You rent dedicated server for cheap as base load and use cloud for everything else.

The fact is, managing your own hardware is a pita and a distraction from focusing on the core product. I loathe messing with servers and even opt for "overpriced" paas like fly, render, vercel. Because every minute messing with and monitoring servers is time not spent on product. My tune might change past a certain size and a massive cloud bill and there's room for full time ops people, but to offset their salary, it would have to be huge.

That argument makes sense for PaaS services like the ones you mention. But for bare "cloud" like AWS, I'm not convinced it is saving any effort, it's merely swapping one kind of complexity with another. Every place I've been in had full-time people messing with YAML files or doing "something" with the infrastructure - generally trying to work around the (self-inflicted) problems introduced by their cloud provider - whether it's the fact you get 2010s-era hardware or that you get nickel & dimed on absolutely arbitrary actions that have no relationship to real-world costs.

In what sense is AWS "bare cloud"? S3, DynamoDB, Lambda, ECS?

How do you configure S3 access control? You need to learn & understand how their IAM works.

How do you even point a pretty URL to a lambda? Last time I looked you need to stick an "API gateway" in front (which I'm sure you also get nickel & dimed for).

How do you go from "here's my git repo, deploy this on Fargate" with AWS? You need a CI pipeline which will run a bunch of awscli commands.

And I'm not even talking about VPCs, security groups, etc.

Somewhat different skillsets than old-school sysadmin (although once you know sysadmin basics, you realize a lot of these are just the same concepts under a branded name and arbitrary nickel & diming sprinkled on top), but equivalent in complexity.

How does one install and run Linux/BSD/another UNIX? One needs to learn and understand how a UNIX works.

The essence of the complaint that one has to have the knowledge of something before that something can be used. It seems like a reasonable expectation for just about anything in life.

(The API gateway in AWS is USD 2.35 for 10 million 32 kB requests, a Lambda can have its own private URL if required and Fargate does not deploy Git repos, it runs Docker images.)

EC2

I would actually argue that EC2 is a "cloud smell"--if you're using EC2 you're doing it wrong.

Counterpoint: if you’re never “messing with servers,” you probably don’t have a great understanding of how their metrics map to those of your application’s, and so if you bottleneck on something, it can be difficult to figure out what to fix. The result is usually that you just pay more money to vertically scale.

To be fair, you did say “my tune might change past a certain size.” At small scale, nothing you do within reason really matters. World’s worst schema, but your DB is only seeing 100 QPS? Yeah, it doesn’t care.

I don’t think you’re correct. I’ve watched junior/mid-level engineers figure things out solely by working on the cloud and scaling things to a dramatic degree. It’s really not a rocket science.

I didn't say it's rocket science, nor that it's impossible to do without having practical server experience, only that it's more difficult.

Take disks, for example. Most cloud-native devs I've worked with have no clue what IOPS are. If you saturate your disk, that's likely to cause knock-on effects like increased CPU utilization from IOWAIT, and since "CPU is high" is pretty easy to understand for anyone, the seemingly obvious solution is to get a bigger instance, which depending on the application, may inadvertently solve the problem. For RDBMS, a larger instance means a bigger buffer pool / shared buffers, which means fewer disk reads. Problem solved, even though actually solving the root cause would've cost 1/10th or less the cost of bumping up the entire instance.

> Most cloud-native devs

You might be making some generalizations from your personal experience. Since 2015, at all of my jobs, everything has been running on some sort of a cloud. I'm yet to meet a person who doesn't understand IOPS. If I was a junior (and from my experience, that's what they tend to do), I'd just google "slow X potential reasons". You'll most likely see some references to IOPS and continue your research from there.

We've learned all these things one way or another. My experience started around 2007ish when I was renting out cheap servers from some hosting providers. Others might be dipping their feet into readily available cloud-infrastructure, and learning it from that end. Both works.

Anecdotal - but I once worked for a company where the product line I built for them after acquisition was delayed by 5 months because that's how long it took to get the hardware ordered and installed in the datacenter. Getting it up on AWS would have been a days work, maybe two.

Yes, it is death by 1000 cuts. Speccing, negotiating with hardware vendors, data center selection and negotiating, DC engineer/remote hands, managing security cage access, designing your network, network gear, IP address ranges, BGP, secure remote console access, cables, shipping, negotiating with bandwidth providers (multiple, for redundancy), redundant hardware, redundant power sources, UPS. And then you get to plug your server in. Now duplicate other stuff your cloud might provide, like offsite backups, recovery procedures, HA storage, geographic redundancy. And do it again when you outgrown your initial DC. Or build your own DC (power, climate, fire protection, security, fiber, flooring, racks)

Much of this is still required in cloud. Also, I think you're missing the middle ground where 99.99% of companies could happily exist indefinitely: colo. It makes little to no financial or practical sense for most to run their own data centers.

Oh, absolutely, with your own hardware you need planning. Time to deployment is definitely a thing.

Really, the one major thing that bites on cloud providers in there 99.9% margin on egress. The markup is insane.

Writing piles of IaC code like Terraform and CloudFormation is also a PITA and a distraction from focusing on your core product.

PaaS is probably the way to go for small apps.

A small app (or a larger one, for that matter) can quite easily run on infra that's instantiated from canned IaC, like TF AWS Modules [0]. If you can read docs, you should be able to quite trivially get some basic infra up in a day, even with zero prior experience managing it.

[0]: https://github.com/terraform-aws-modules

Yes, I've used several of these modules myself. They save tons of time! Unfortunately, for legacy projects, I inherited a bunch of code from individuals that built everything "by hand" then copy-pasted everything. No re-usability.

But that effort has a huge payoff in that it can be used to disaster recovery in a new region and to spin up testing environments.

I'm with you there, with stuff like fly.io, there's really no reason to worry about infrastructure.

AWS, on the other hand, seems about as time consuming and hard as using root servers. You're at a higher level of abstraction, but the complexity is about the same I'd say. At least that's my experience.

I agree with this position and actively avoid AWS complexity.

> every minute messing with and monitoring servers

You're not monitoring your deployments because "cloud"?

Well cloud providers often give more than just VMs in a data enter somewhere. You may not be able to find good equivalents if you aren’t using the cloud. Some third-party products are also only available on clouds. How much of a difference those things make will depend on what you’re trying to do.

I think there are accounting reasons for companies to prefer paying opex to run things on the cloud instead of more capex-intensive self-hosting, but I don’t understand the dynamics well.

It’s certainly the case that clouds tend to be more expensive than self-hosting, even when taking account of the discounts that moderately sized customers can get, and some of the promises around elastic scaling don’t really apply when you are bigger.

To some of your other points: the main customers of companies like AWS are businesses. Businesses generally don’t care about the centralisation of the internet. Businesses are capable of reading the contracts they are signing and not signing them if privacy (or, typically more relevant to businesses, their IP) cannot be sufficiently protected. It’s not really clear to me that using a cloud is going to be less secure than doing things on-prem.

[deleted]

> All the pro-cloud talking points are just that - talking points that don't persuade anyone with any real technical understanding ...

And moreover most of the actual interesting things, like having VM templates and stateless containers, orchestration, etc. is very easy to run yourself and gets you 99.9% of the benefits of the cloud.

About just any and every service is available as container file already written for you. And if it doesn't exist, it's not hard to plumb up.

A friend of mine runs more than 700 containers (yup, seven hundreds), split over his own rack at home (half of them) and the other half on dedicated servers (he runs stuff like FlightRadar, AI models, etc.). He'll soon get his own IP addresses space. Complete "chaos monkey" ready infra where you can cut any cable and the thing shall keep working: everything is duplicated, can be spun up on demand, etc. Someone could still his entire rack and all his dedicated server, he'd still be back operational in no time.

If an individual can do that, a company, no matter its size, can do it too. And arguably 99.9% of all the companies out there don't have the need for an infra as powerful as the one most homelab enthusiast have.

And another thing: there's even two in-betweens between "cloud" and "our own hardware located at our company". First is colocating your own hardware but in a datacenter. Second is renting dedicated servers from a datacenter.

They're often ready to accept cloud-init directly.

And it's not hard. I'd say learning to configure hypervisors on bare metal, then spin VMs from templates, then running containers inside the VMs is actually much easier than learning all the idiosyncrasies of all the different cloud vendors APIs and whatnots.

Funnily enough when the pendulum swung way too far on the "cloud all the things" side, those saying at some point we'd read story about repatriation were being made fun of.

> If an individual can do that, a company, no matter its size, can do it too.

Fully agreed. I don't have physical HA – if someone stole my rack, I would be SOL – but I can easily ride out a power outage for as long as I want to be hauling cans of gasoline to my house. The rack's UPS can keep it up at full load for at least 30 minutes, and I can get my generator running and hooked up in under 10. I've done it multiple times. I can lose a single server without issue. My only SPOF is internet, and that's only by choice, since I can get both AT&T and Spectrum here, and my router supports dual-WAN with auto-failover.

> And arguably 99.9% of all the companies out there don't have the need for an infra as powerful as the one most homelab enthusiast have.

THIS. So many people have no idea how tremendously fast computers are, and how much of an impact latency has on speed. I've benchmarked my 12-year old Dells against the newest and shiniest RDS and Aurora instances on both MySQL and Postgres, and the only ones that kept up were the ones with local NVMe disks. Mine don't even technically have _local_ disks; they're NVMe via Ceph over Infiniband.

Does that scale? Of course not; as soon as you want geo-redundant, consistent writes, you _will_ have additional latency. But most smaller and medium companies don't _need_ that.

> All the pro-cloud talking points are just that - talking points that don't persuade anyone with any real technical understanding,(...)

This is where you lose all credibility.

I'm going to focus on a single aspect: performance. If you're serving a global user base and your business, like practically all online businesses, is greatly impacted by performance problems, the only solution to a physics problem is to deploy your application closer to your users.

With any cloud provider that's done with a few clicks and an invoice of a few hundred bucks a month. If you're running your hardware... What solution do you have to show for? Do you hope to create a corporate structure to rent a place to host your hardware manned by a dedicated team? What options f you have?

Is everyone running online FPS gaming servers now? If you want your page to load faster, tell your shitty frontend engineers to use less of the latest frameworks. You are not limited by physics, 99% aren't.

I ping HN, it's 150ms away, it still renders in the same time that the Google frontpage does and that one has a 130ms advantage.

Erm, 99%'s clearly wrong and I think you know it, even if you are falling into the typical trap of "only Americans matter"...

As someone in New Zealand, latency does really matter sometimes, and is painfully obvious at times.

HN's ping for me is around: 330 ms.

Anyway, ping doesn't really describe the latency of the full DNS lookup propogation, TCP connection establishment and TLS handshake: full responses for HN are around 900 ms for me till last byte.

> latency does really matter sometimes

Yes, sometimes.

You know what matters way more?

If you throw 12MBytes to the client in a multiple connections on multiple domains to display 1KByte of information. Eg: 'new' Reddit.

The complexity of scaling out an application to be closer to the users has never been about getting the hardware closer. It's always about how do you get the data there and dealing with the CAP theorem, which requires hard tradeoffs to be decided on when designing the application and can't be just tacked on - there is no magic button to do this, in the AWS console or otherwise.

Getting the hardware closer to the users has always been trivial - call up any of the many hosting providers out there and get a dedicated server, or a colo and ship them some hardware (directly from the vendor if needed).

> This is where you lose all credibility.

People who write that, well...

If you're greatly impacted by performance problems, how does that become a physics problem that has as a solution which is being closer to your users?

I think you're mixing up your sales points. One, how do you scale hardware? Simple: you buy some more, and/or you plan for more from the beginning.

How do you deal with network latency for users on the other side of the planet? Either you plan for and design for long tail networking, and/or you colocate in multiple places, and/or you host in multiple places. Being aware of cloud costs, problems and limitations doesn't mean you can't or shouldn't use cloud at all - it just means to do it where it makes sense.

You're making my point for me - you've got emotional generalizations ("you lose all credibility"), you're using examples that people use often but that don't even go together, plus you seem to forget that hardly anyone advocates for all one or all the other, without some kind of sensible mix. Thank you for making a good example of exactly what I'm talking about.

If have a global user base, depending on your workload, a simple CDN in front of your hardware can often go a long ways with minimal cost and complexity.

> If have a global user base, depending on your workload, a simple CDN in front of your hardware can often go a long ways with minimal cost and complexity.

Let's squint hard enough to pretend a CDN does not qualify as "the cloud". That alone requires a lot of goodwill.

A CDN distributes read-only content. Any usecase that requires interacting with a service is automatically excluded.

So, no.

> Any usecase that requires interacting with a service is automatically excluded

This isn't correct. Many applications consist of a mix of static and dynamic content. Even dynamic content is often cacheable for a time. All of this can be served by a CDN (using TTLs) which is a much simpler and more cost effective solution than multi-region cloud infra, with the same performance benefits.

I hear this debate repeated often, and I think there's another important factor. It took me some time to figure out how to explain it, and the best I came up with was this: It is extremely difficult to bootstrap from zero to baseline competence, in general, and especially in an existing organization.

In particular, there is a limit to paying for competence, and paying more money doesn't automatically get you more competence, which is especially perilous if your organization lacks the competence to judge competence. In the limit case, this gets you the Big N consultancies like PWC or EY. It's entirely reasonable to hire PWC or EY to run your accounting or compliance. Hiring PWC or EY to run your software development lifecycle is almost guaranteed doom, and there is no shortage of stories on this site to support that.

In comparison, if you're one of these organizations, who don't yet have baseline competence in technology, then what the public cloud is selling is nothing short of magical: You pay money, and, in return, you receive a baseline set of tools, which all do more or less what they say they will do. If no amount of money would let you bootstrap this competence internally, you'd be much more willing to pay a premium for it.

As an anecdote, my much younger self worked in mid-sized tech team in a large household brand in a legacy industry. We were building out a web product that, for product reasons, had surprisingly high uptime and scalability requirements, relative to legacy industry standards. We leaned heavily on public cloud and CDNs. We used a lot of S3 and SQS, which allowed us to build systems with strong reliability characteristics, despite none of us having that background at the time.

I have about 30 years as a linux eng, starting with openbsd and have spent a LOT of time with hardware building webhosts and CDNs until about 2020 where my last few roles have been 100% aws/gcloud/heroku.

I love building the cool edge network stuff with expensive bleeding edge hardware, smartnics, nvmeOF, etc but its infinitely more complicated and stressful than terraforming an AWS infra. Every cluster I set up I had to interact with multiple teams like networking, security, storage sometimes maintenance/electrical, etc. You've got some random tech you have to rely on across the country in one of your POPs with a blown server. Every single hardware infra person has had a NOC tech kick/unplug a server at least once if they've been in long enough.

And then when I get the hardware sometimes you have different people doing different parts of setup, like NOC does the boot, maybe boostraps the hardware with something that works over ssh before an agent is installed (ansible, etc), then your linux eng invokes their magic with a ton of bash or perl, then your k8s person sets up the k8s clusters with usually something like terraform/puppet/chef/salt probably calling helm charts. Then your monitoring person gets it into OTEL/grafana, etc. This all organically becomes more automated as time goes on, but I've seen it from a brand new infra where you've got no automation many times.

Now you're automating 90% of this via scripts and IAC, etc, but you're still doing a lot of tedious work.

You also have a much more difficult time hiring good engineers. The markets gone so heavily AWS (I'm no help) that its rare that I come across an ops resume that's ever touched hardware, especially not at the CDN distributed systems level.

So.. aws is the chill infra that stays online and you can basically rely on 99.99something%. Get some terraform blueprints going and your own developers can self serve. Don't need hardware or ops involved.

And none of this is even getting into supporting the clusters. Failing clusters. Dealing with maintenance, zero downtime kernel upgrades, rollbacks, yaddayadda.

This 1000%. There are so many cool networking/virtualization/hardware things I love dealing with. But the stress of doing ceph upgrades isn't the right trade off usually.

Most companies severely understaff ops, infra, and security. Your talking points might be good but, in practice, won’t apply in many cases because of the intractability of that management mindset. Even when they should know better.

I’ve worked at tech companies with hundreds of developers and single digit ops staff. Those people will struggle to build and maintain mature infra. By going cloud, you get access to mature infra just by including it in build scripts. Devops is an effective way to move infra back to project teams and cut out infra orgs (this isn’t great but I see it happen everywhere). Companies will pay cloud bills but not staffing salaries.

It's the exact same reason why most companies don't just run their own power stations, and instead buy it from a power company.

Computation has become a utility these days - this includes the fat ISP lines and connectivity etc, not just the CPU and harddrives. These things have economies of scale that smaller companies cannot truly reach, and will pay a huge fixed cost if they want state of the art management, monitoring and redundancy. So unless you are a massive consumer, just like power stations, you really don't need nor want to build your own.

Using a commercial cloud provider only cements understaffing in, in too many cases.

There is a whole ecosystem that pushes cloud to ignorant/fresh graduates/developers. Just take a look at the sponsors for all the most popular frameworks. When your system is super complex and depends on the cloud they make more money. Just look at the PHP ecosystem, Laravel needs 4 times the servers to server something that a pure PHP system would need. Most projects don't need the cloud. Only around 10% of projects actually need what the cloud provides. But they were able to brainwash a whole generation of developers/managers to think that they do. And so it goes.

Having worked with Laravel, this is absolutely bull.

>What's particularly fascinating to me, though, is how some people are so pro-cloud that they'd argue with a writeup like this with silly cloud talking points. They don't seem to care much about data or facts, just that they love cloud and want everyone else to be in cloud, too.

The irony is absolutely dripping off this comment, wow.

Commenter makes emotionally charge comment with no data or facts and decries anyone who disagrees with them as "silly talking points" for not caring about data and facts.

Your comment is entirely talking about itself.

My take on this whole cloud fatigue is that system maintenance got overly complex over the last couple years/decades. So much that management people now think that it's too expensive in terms of hiring people that can do it compared to the higher managed hosting costs.

DevOps and kubernetes come to mind. A lot of people using kubernetes don't know what they're getting into, and k0s or another single machine solution would have been enough for 99% of SMEs.

In terms of cyber security (my field) everything got so ridiculously complex that even the folks that use 3 different dashboards in parallel will guess the answers as to whether or not they're affected by a bug/RCE/security flaw/weakness because all of the data sources (even the expensively paid for ones) are human-edited text databases. They're so buggy that they even have Chinese idiom symbols instead of a dot character in the version fields without anyone ever fixing it upstream in the NVD/CVE process.

I started to build my EDR agent for POSIX systems specifically, because I hope that at some point this can help companies to ditch the cloud and allows them to selfhost again - which in return would indirectly prevent 13 year old kids like from LAPSUS to pwn major infrastructure via simple tech support hotline calls.

When I think of it in terms of hosting, the vertical scalability of EPYC machines is so high that most of the time when you need its resources you are either doing something completely wrong and you should refactor your code or you are a video streaming service.

There was a time when cloud was significantly cheaper then owning.

I'd expect that there are people who moved to the cloud then, and over time started using services offered by their cloud provider (e.g., load balancers, secret management, databases, storage, backup) instead of running those services themselves on virtual machines, and now even if it would be cheaper to run everything on owned servers they find it would be too much effort to add all those services back to their own servers.

The cloud wasn’t about cheap, it was about fast. If you’re VC funded, time is everything, and developer velocity above all else to hyperscale and exit. That time has passed (ZIRP), and the public cloud margin just doesn’t make sense when you can own and operate (their margin is your opportunity) on prem with similar cloud primitives around storage and compute.

Elasticity is a component, but has always been from a batch job bin packing scheduling perspective, not much new there. Before k8s and Nomad, there was Globus.org.

(Infra/DevOps in a previous life at a unicorn, large worker cluster for a physics experiment prior, etc; what is old is a new again, you’re just riding hype cycle waves from junior to retirement [mainframe->COTS on prem->cloud->on prem cloud, and so on])

That was never true except in the case that the required hardware resources were significantly smaller than a typical physical machine.

1. People are credulous

2. People therefore repeat talking points which seem in their interest

3. With enough repetition these become their beliefs

4. People will defend their beliefs as theirs against attack

5. Goto 1

The one convincing argument from technical people I saw, that would be repeated to your comment, is that by now, you dont find enough experienced engineers to reliably setup some really big systems. Because so much went to the cloud, a lot of the knowledge is buried there.

That came from technical people who I didn't perceive as being dogmatically pro-cloud.

I think part of it was a way for dev teams to get an infra team that was not empowered to say no. Plus organizational theory, empire building, etc.

Yep. I had someone tell me last week that they didn't want a more rigid schema because other teams rely on it, and anything adding "friction" to using it would be poorly received.

As an industry, we are largely trading correctness and performance for convenience, and this is not seen as a negative by most. What kills me is that at every cloud-native place I've worked at, the infra teams were both responsible for maintaining and fixing the infra that product teams demanded, but were not empowered to push back on unreasonable requests or usage patterns. It's usually not until either the limits of vertical scaling are reached, or a SEV0 occurs where these decisions were the root cause does leadership even begin to consider changes.

It seems that the preference is less about understanding or misunderstanding the technical requirements but more that it moves a capital expenditure with some recurring operational expenditure entirely into the opex column.

> It makes me wonder: how do people get so sold on a thing that they'll go online and fight about it, even when they lack facts or often even basic understanding?

I feel like this can be applied to anything.

I had a manager take one SAFe for Leaders class then came back wanting to implement it. They had no previous AGILE classes or experience. And the Enterprise Agile Office was saying DON'T USE SAFe!!

But they had one class and that was the only way they would agree to structure their group.

Cloud solves one problem quite well: Geographic redundancy. It's extremely costly with on-prem.

Only if you’re literally running your own datacenters, which is in no way required for the majority of companies. Colo giants like Equinix already have the infrastructure in place, with a proven track record.

If you enable Multi-AZ for RDS, your bill doubles until you cancel. If you set up two servers in two DCs, your initial bill doubles from the CapEx, and then a very small percentage of your OpEx goes up every month for the hosting. You very, very quickly make this back compared to cloud.

But reliable connectivity between regions/datacenters remains a challenge, right? Compute is only one part of the equation.

Disclaimer: I work on a cloud networking product.

It depends on how deep you want to go. Equinix for one (I'm sure others as well, but I'm most familiar with them) offers managed cross-DC fiber. You will probably need to manage the networking, to be fair, and I will readily admit that's not trivial.

[deleted]

I use Wireguard, pretty simple, where's the challenge?

I am referring to the layer 3 connectivity that Wireguard is running on top of. Depending on your use case and reliability and bandwidth requirements, routing everything over the “public” internet won’t cut it.

Not to mention setting up and maintaining your physical network as the number of physical hosts you’re running scales.

Except, almost nobody, outside of very large players, does cross region redundancy. us-east-1 is like a SPOF for the entire Internet.

Cloud noob here. But if I have a central database what can I distribute across geographic regions? Static assets? Maybe a cache?

Yep. Cross-region RDBMS is a hard problem, even when you're using a managed service – you practically always have to deal with eventual consistency, or increased latency for writes.

Does it? I've seen outages around "Sorry, us-west_carolina-3 is down". AWS is particularly good at keeping you aware of their datacenters.

It can be useful. I run a latency sensitive service with global users. A cloud lets me run it in 35 locations dealing with one company only. Most of those locations only have traffic to justify a single, smallish, instance.

In the locations where there's more traffic, and we need more servers, there are more cost effective providers, but there's value in consistency.

Elasticity is nice too, we doubled our instance count for the holidays, and will return to normal in January. And our deployment style starts a whole new cluster, moves traffic, then shuts down the old cluster. If we were on owned hardware, adding extra capacity for the holidays would be trickier, and we'd have to have a more sensible deployment method. And the minimum service deployment size would probably not be a little quad processor box with 2GB ram.

Using cloud for the lower traffic locations and a cost effective service for the high traffic locations would probably save a bunch of money, but add a lot of deployment pain. And a) it's not my decision and b) the cost difference doesn't seem to be quite enough to justify the pain at our traffic levels. But if someone wants to make a much lower margin, much simpler service with lots of locations and good connectivity, be sure to post about it. But, I think the big clouds have an advantage in geographic expansion, because their other businesses can provide capital and justification to build out, and high margins at other locations help cross subsidize new locations when they start.

I agree it can be useful (latency, availability, using off-peak resources), but running globally should be a default and people should opt-in into fine-grained control and responsibility.

From outside it seems that either AWS picked the wrong default to present their customers, or that it's unreasonably expensive and it drives everyone into the in-depth handling to try to keep cloud costs down.

if you see that you are doing it wrong :)

AWS has had multiple outages which were caused by a single AZ failing.

My company used to do everything on-prem. Until a literal earthquake and tsunami took down a bunch of systems.

After that, yeah we’ll let AWS do the hard work of enabling redundancy for us.

The problem with your claims here is they can only be right if the entire industry is experiencing mass psychosis. I reject a theory that requires that, because my ego just isn't that large.

I once worked for several years at a publicly traded firm well-known for their return-to-on-prem stance, and honestly it was a complete disaster. The first-party hardware designs didn't work right because they didn't have the hardware designs staffing levels to have de-risked to possibility that AMD would fumble the performance of Zen 1, leaving them with a generation of useless hardware they nonetheless paid for. The OEM hardware didn't work right because they didn't have the chops to qualify it either, leaving them scratching their heads for months over a cohort of servers they eventually discovered were contaminated with metal chips. And, most crucially, for all the years I worked there, the only thing they wanted to accomplish was failover from West Coast to East Coast, which never worked, not even once. When I left that company they were negotiating with the data center owner who wanted to triple the rent.

These experiences tell me that cloud skeptics are sometimes missing a few terms in their equations.

"Vendor problems" is a red herring, IMO; you can have those in the cloud, too.

It's been my experience that those who can build good, reliable, high-quality systems, can do so either in the cloud or on-prem, generally with equal ability. It's just another platform to such people, and they will use it appropriately and as needed.

Those who can only make it work in the cloud are either building very simple systems (which is one place where the cloud can be appropriate), or are building a house of cards that will eventually collapse (or just cost them obscene amounts of money to keep on life support).

Engineering is engineering. Not everyone in the business does it, unfortunately.

Like everything, the cloud has its place -- but don't underestimate the number of decisions that get taken out of the hands of technical people by the business people who went golfing with their buddy yesterday. He just switched to Azure, and it made his accountants really happy!

The whole CapEx vs. OpEx issue drives me batty; it's the number one cause of cloud migrations in my career. For someone who feels like spent money should count as spent money regardless of the bucket it comes out of, this twists my brain in knots.

I'm clearly not a finance guy...

> or are building a house of cards that will eventually collapse (or just cost them obscene amounts of money to keep on life support)

Ding ding ding. It's this.

> The whole CapEx vs. OpEx issue drives me batty

Seconded. I can't help but feel like it's not just a "I don't understand money" thing, but more of a "the way Wall Street assigns value is fundamentally broken." Spending $100K now, once, vs. spending $25K/month indefinitely does not take a genius to figure out.

> Spending $100K now, once, vs. spending $25K/month indefinitely does not take a genius to figure out.

If you multiply your month payment for 1/i, where i is the interest rate your business can get, you will get how much of up-front money it's worth.

... that is, until next month, when the interest rate will change, a fact that always catches everyone by surprise, and you'll need to rush to fix your cash-flow.

So, yeah, I don't understand that either. Somehow, despite neither of us understanding how it can possibly work, it seems to fail to work empirically too, adding a huge amount of instability to companies.

That is, unless you decide to look at it from the perspective of executive bonuses, that are capped to 0, but can grow indefinitely. So instability is the point.

you forgot cogs

it's all about painting the right picture for your investors, so you make up shit and classify as cogs or opex depending on what is most beneficial for you in the moment

> The problem with your claims here is they can only be right if the entire industry is experiencing mass psychosis.

Yes. Mass psychosis explains an incredible number of different and apparently unrelated problems with the industry.

There's however a middle-ground between run your own colocated hardware and cloud. It's called "dedicated" servers and many hosting providers (from budget bottom-of-the-barrel to "contact us" pricing) offer it.

Those take on the liability of sourcing, managing and maintaining the hardware for a flat monthly fee, and would take on such risk. If they make a bad bet purchasing hardware, you won't be on the hook for it.

This seems like a point many pro-cloud people (intentionally?) overlook.

> The problem with your claims here is they can only be right if the entire industry is experiencing mass psychosis.

What's the market share of Windows again? ;)

You're proving their point though. Considering that there are tons of reasons to use windows, some people just don't see them and think that everyone else is crazy :^) (I know you're joking but some people actually unironically have the same sentiment)

> a desire to not centralize the Internet

> If I didn't already self-host email

this really says all that needs to be said about your perspective. you have an engineer and OSS advocate's mindset. which is fine, but most business leaders (including technical leaders like CTOs) have a business mindset, and their goal is to build a business that makes money, not avoid contributing to the centralization of the internet

> On the other hand, a business of just about any size that has any reasonable amount of hosting is better off with their own systems when it comes purely to cost

From a cost PoV, sure, but when you're taking money out of capex it represents a big hit to the cash flow, while taking out twice that amount from opex has a lower impact on the company finances.

Cloud is more than instances. If all you need is a bunch of boxes, then cloud is a terrible fit.

I use AWS cloud a lot, and almost never use any VMs or instances. Most instances I use are along the lines of a simple anemic box for a bastion host or some such.

I use higher level abstractions (services) to simplify solutions and outsource maintenance of these services to AWS.

They spent time and career points learning cloud things and dammit it's going to matter!

You can't even blame them too much, the amount of cash poured into cloud marketing is astonishing.

The thing that frustrates me is it’s possible to know how to do both. I have worked with multiple people who are quite proficient in both areas.

Cloud has definite advantages in some circumstances, but so does self-hosting; moreover, understanding the latter makes the former much, much easier to reason about. It’s silly to limit your career options.

Being good at both is twice the work, because even if some concepts translate well, IME people won't hire someone based on that. "Oh you have experience with deploying RabbitMQ but not AWS SQS? Sorry, we're looking for someone more qualified."

That's a great filter for places I don't want to work at, then.

I want to see an article like this, but written from a Fortune 500 CTO perspective

It seems like they all abandoned their VMware farms or physical server farms for Azure (they love Microsoft).

Are they actually saving money? Are things faster? How's performance? What was the re-training/hiring like?

In one case I know we got rid of our old database greybeards and replaced them with "DevOps" people that knew nothing about performance etc

And the developers (and many of the admins) we had knew nothing about hardware or anything so keeping the physical hardware around probably wouldn't have made sense anyways

Complicating this analysis is that computers have still been making exponential improvements in capability as clouds became popular (e.g. disks are 1000-10000x faster than they were 15 years ago), so you'd naturally expect things to become easier to manage over time as you need fewer machines, assuming of course that your developers focus on e.g. learning how to use a database well instead of how to scale to use massive clusters.

That is, even if things became cheaper/faster, they might have been even better without cloud infrastructure.

>we got rid of our old database greybeards and replaced them with "DevOps" people that knew nothing about performance etc

Seems a lot of those DevOps people just see Azures recommendations for adding indexes and either just allow auto applying them or just adding them without actually reviewing it understanding what use loads require them and why. This also lands a bit on developers/product that don't critically think about and communicate what queries are common and should have some forethought on what indexes should be beneficial and created. (Yes followup monitoring of actual index usage and possible missing indexes is still needed.) Too many times I've seen dozens of indexes on tables in the cloud where one could cover all of them. Yes, there still might be worthwhile reasons to keep some narrower/smaller indexes but again DBA and critical query analysis seems to be a forgotten and neglected skill. No one owns monitoring and analysing db queries and it only comes up after a fire has already broken out.

The real cost wins of self-hosted are that anything using new hardware becomes an ordeal, and engineers won't use high-cost, value-added services. I agree that there's often too little restraint in cloud architectures, but if a business truly believes in a project, it shouldn't be held up for six months waiting for server budget with engineers spending doing ops work to get three nines of DB reliability.

There is a size where self-hosting makes sense, but it's much larger than you think.

Also, by the way, I found it interesting that you framed your side of this disagreement as the technically correct one, but then included this:

> a desire to not centralize the Internet

This is an ideological stance! I happen to share this desire. But you should be aware of your own non-technical - "emotional" - biases when dismissing the arguments of others on the grounds that they are "emotional" and+l "fanatical".

I never said that my own reasons were neither personal nor emotional. I was just pointing out that my reasons are easy to articulate.

I do think it's more than just emotional, though, but most people, even technical people, haven't taken the time to truly consider the problems that will likely come with centralization. That's a whole separate discussion, though.

...but your post reads like you do have an emotional reaction to this question and you're ready to believe someone who shares your views.

There's not nearly enough in here to make a judgment about things like security or privacy. They have the bare minimum encryption enabled. That's better than nothing. But how is key access handled? Can they recover your email if the entire cluster goes down? If so, then someone has access to the encryption keys. If not, then how do they meet reliability guarantees?

Three letter agencies and cyber spies like to own switches and firewalls with zero days. What hardware are they using, and how do they mitigate against backdoors? If you really cared about this you would have to roll your own networking hardware down to the chips. Some companies do this, but you need to have a whole lot of servers to make it economical.

It's really about trade-offs. I think the big trade-offs favoring staying off cloud are cost (in some applications), distrust of the cloud providers,and avoiding the US Government.

The last two are arguably judgment calls that have some inherent emotional content. The first is calculable in principle, but people may not be using the same metrics. For example if you don't care that much about security breaches or you don't have to provide top tier reliability, then you can save a ton of money. But if you do have to provide those guarantees, it would be hard to beat Cloud prices.

In the public sector, cloud solves the procurement problem. You just need to go through the yearlong process once to use a cloud service, instead of for each purchase > 1000€.

Capital expenditures are kryptonite to financial engineers. The cloud selling point was to trade those costs for operational expenses and profit in phase 3.

As someone who ran a startup with 100’s of hosts. As soon as I start to count the salaries, hiring, desk space, etc of the people needed to manage the hosts AWS would look cheap again. Yea, hardware costs they are aggressively expensive. But TCO wise, they’re cheap for any decent sized company.

Add in compliance, auditing, etc. all things that you can set up out of the box (PCI, HIPPA, lawsuit retention). Gets even cheaper.

I'm curious about what "reasonable amount of hosting" means to you, because from my experience, as your internal network's complexity goes up, it's far better for your to move systems to a hyperscaler. The current estimate is >90% of Fortune 500 companies are cloud-based. What is it that you know that they don't?

> how do people get so sold on a thing that they'll go online and fight about it, even when they lack facts or often even basic understanding?

Are you new to the internet?

The bottom line > babysitting hardware. Businesses are transitioning to cloud because it's better for business.

Actually, there's been a reversal trend going on, for many companies, better is often on premises or hybrid now.

> What's particularly fascinating to me, though, is how some people are so pro-cloud that they'd argue with a writeup like this with silly cloud talking points.

I’m sure I’ll be downvoted to hell for this, but I’m convinced that it’s largely their insecurities being projected.

Running your own hardware isn’t tremendously difficult, as anyone who’s done it can attest, but it does require a much deeper understanding of Linux (and of course, any services which previously would have been XaaS), and that’s a vanishing trait these days. So for someone who may well be quite skilled at K8s administration, serverless (lol) architectures, etc. it probably is seen as an affront to suggest that their skill set is lacking something fundamental.

> So for someone who may well be quite skilled at K8s administration ...

And running your own hardware is not incompatible with Kubernetes: on the contrary. You can fully well have your infra spin up VMs and then do container orchestration if that's your thing.

And part your hardware monitoring and reporting tool can work perfectly fine from containers.

Bare metal -> Hypervisor -> VM -> container orchestration -> a container running a "stateless" hardware monitoring service. And VMs themselves are "orchestrated" too. Everything can be automated.

Anyway say a harddisk being to show errors? Notifications being sent (email/SMS/Telegram/whatever) by another service in another container, dashboard shall show it too (dashboards are cool).

Go to the machine once the spare disk as already been resilvered, move it where the failed disk was, plug in a new disk that becomes the new spare.

Boom, done.

I'm not saying all self-hosted hardware should do container orchestration: there are valid use cases for bare metal too.

But something as to be said about controlling everything on your own infra: from the bare metal to the VMs to container orchestration. To even potentially your own IP address space.

This is all within reach of an individual, both skill-wise and price-wise (including obtaining your own IP address space). People who drank the cloud kool-aid should ponder this and wonder how good their skills truly are if they cannot get this up and working.

Fully agree. And if you want to take it to the next level (and have a large budget), Oxide [0] seems to have neatly packaged this into a single coherent product. They don't quite have K8s fully running, last I checked, but there are of course other container orchestration systems.

> Go to the machine once the spare disk as already been resilvered

Hi, fellow ZFS enthusiast :-)

[0]: https://oxide.computer

> And running your own hardware is not incompatible with Kubernetes: on the contrary

Kubernetes actually makes so much more sense on bare-metal hardware.

On the cloud, I think the value prop is dubious - your cloud provider is already giving you VMs, why would you need to subdivide them further and add yet another layer of orchestration?

Not to mention that you're getting 2010s-era performance on those VMs, so subdividing them is terrible from a performance point of view too.

> Not to mention that you're getting 2010s-era performance on those VMs, so subdividing them is terrible from a performance point of view too.

I was trying in vain to explain to our infra team a couple of weeks ago why giving my team a dedicated node of a newer instance family with DDR5 RAM would be beneficial for an application which is heavily constrained by RAM speed. People seem to assume that compute is homogenous.

I would wager that the same kind of people that were arguing against your request for a specific hardware config are the same ones in this comment section railing against any sort of self-sufficiency by hosting it yourself on hardware. All they know is cloud, all they know how to do is "ScAlE Up thE InStanCE!" when shit hits the fan. It's difficult to argue against that and make real progress. I understand your frustration completely.

I agree, I run PROD, TEST and DEV kube clusters all in VM's, works great.

[deleted]

> If I didn't already self-host email, I'd consider using Fastmail.

Same sentiment all of what you said.

[deleted]

> All the pro-cloud talking points are just that - talking points that don't persuade anyone with any real technical understanding, but serve to introduce doubt to non-technical people and to trick people who don't examine what they're told.

This feels like "no true scotsman" to me. I've been building software for close to two decades, but I guess I don't have "any real technical understanding" because I think there's a compelling case for using "cloud" services for many (honestly I would say most) businesses.

Nobody is "afraid to openly discuss how cloud isn't right for many things". This is extremely commonly discussed. We're discussing it right now! I truly cannot stand this modern innovation in discourse of yelling "nobody can talk about XYZ thing!" while noisily talking about XYZ thing on the lowest-friction publishing platforms ever devised by humanity. Nobody is afraid to talk about your thing! People just disagree with you about it! That's ok, differing opinions are normal!

Your comment focuses a lot on cost. But that's just not really what this is all about. Everyone knows that on a long enough timescale with a relatively stable business, the total cost of having your own infrastructure is usually lower than cloud hosting.

But cost is simply not the only thing businesses care about. Many businesses, especially new ones, care more about time to market and flexibility. Questions like "how many servers do we need? with what specs? and where should we put them?" are a giant distraction for a startup, or even for a new product inside a mature firm.

Cloud providers provide the service of "don't worry about all that, figure it out after you have customers and know what you actually need".

It is also true that this (purposefully) creates lock-in that is expensive either to leave in place or unwind later, and it definitely behooves every company to keep that in mind when making architecture decisions, but lots of products never make it to that point, and very few of those teams regret the time they didn't spend building up their own infrastructure in order to save money later.

> The whole push to the cloud has always fascinated me. I get it - most people aren't interested in babysitting their own hardware.

For businesses, it's a very typical lease-or-own decision. There's really nothing too special about cloud.

> On the other hand, a business of just about any size that has any reasonable amount of hosting is better off with their own systems when it comes purely to cost.

Nope. Not if you factor-in 24/7 support, geographic redundancy, and uptime guarantees. With EC2 you can break even at about $2-5m a year of cloud spending if you want your own hardware.

I did compliance for a fintech under heavy regulation.

If we used AWS, we could skip months of certification. If we use a custom data center, we have to certify it ourselves (muuuuuch more expensive).

From this standpoint, cloud beats on-premise.

capex vs opex

To me, Cloud is all about the shift left of DevOps. It’s not a cost play. I’m a Dev Lead / Manager and have worked in both types of environments over the last 10 years. It’s immeasurable the velocity difference as far as system provisioning between the two approaches. In the hardware space, it took months to years to provision new machines or upgrade OSes. In the cloud, it’s a new terraform script and a CI deploy away. Need more storage? It’s just there, available all the time. Need to add a new firewall between machines or redo the network topology? Free. Need a warm standby in 4 different regions that costs almost nothing but can scale to full production capacity within a couple of minutes? Done. Those types of things are difficult to do with physical hardware. And if you have an engineering culture where the operational work and the development work are at odds (think the old style of Dev / QA / Networking / Servers / Security all being separate teams), processes and handoffs eat your lunch and it becomes crippling to your ability to innovate. Cloud and DevOps are to me about reducing the differentiation between these roles so that a single engineer can do any part of the stack, which cuts out the communication overhead and the handoff time and the processes significantly.

If you have predictable workloads, a competent engineering culture that fights against process culture, and are willing to spend the money to have good hardware and the people to man it 24x7x365 then I don’t think cloud makes sense at all. Seems like that’s what y’all have and you should keep up with it.

> In the hardware space, it took months to years to provision new machines or upgrade OSes.

If it takes this long to manage a machine, I strongly suspect it means that when initially designing the system engineers had failed to account for those for some reason. Was that true in your case?

Back in late '00s until mid '10s, I worked for an ISP startup as a SWE. We had a few core machines (database, RADIUS server, self-service website, etc) - ugly mess TBH - initially provisioned and originally managed entirely by hand as we didn't knew any better back then. Naturally, maintaining those was a major PITA, so they sat on the same dated distro for years. That was before Ansible was a thing, and we haven't really heard about Salt or Chef before we started to feel the pains and started to search for solutions. Virtualization (OpenVZ, then Docker) helped to soften a lot of issues, making it significantly easier to maintain the components, but the pains from our original sins were felt for a long time.

But we also had a fleet of other machines, where we understood our issues with the servers enough to design new nodes to be as stateless as possible, with automatic rollout scripts for whatever we were able to automate. Provisioning a new host took only a few hours, with most time spent unpacking, driving, accessing the server room, and physically connecting things. Upgrades were pretty easy too - reroute customers to another failover node, write a new system image to the old one, reboot, test, re-route traffic back, done.

So it's not like self-owned bare metal is harder to manage - the lesson I learned is that one just gotta think ahead of time what the future would require. Same as the clouds, I guess, one has to follow best practices or they'll end up with crappy architectures that will be painful to rework. Just different set of practices, because of the different nature of the systems.

Exactly this. It is culture and organisation (structure) dependent. I'm in the throes of the same discussion with my leader ship team, some of whom have built themselves an ops/qa/etc. empire and want to keep their moat.

Are you running a well understood and predictable (as in, little change, growth, nor feature additions) system? Are your developers handing over to central platform/infra/ops teams? You'll probably save some cash by buying and owning the hardware you need for your use case(s). Elasticity is (probably) not part of your vocabulary, perhaps outside of "I wish we had it" anyway.

Have you got teams and/or products that are scaling rapidly or unpredictably? Have you still got a lot of learning and experimenting to do with how your stack will work? Do you need flexibility but can't wait for that flexibility? Then cloud is for you.

n.b. I don't think I've ever felt more validated by a post/comment than yours.

Our CI pipelines can spin up some seriously meaty hardware, run some very resource intensive tests, and destroy the infrastructure when finished.

Bonus points: they can do it with spot pricing to further lower the bill.

The cloud offers immense flexibility and empowers _developers_ to easily manage their own infrastructure without depending on other teams.

Speed of development is the primary reason $DayJob is moving into the cloud, while maintaining bare-metal for platforms which rarely change.

I think I understand your point, and this is not directed at you personally, but: I think "shift left" is another one of those phrases that's lost all meaning, like "synergy" or "agile" before it.

My first job in tech was building servers for companies when they needed more compute, physically building them from our warehouse of components, driving them to their site, and setting it up in their network.

You could get same day builds deployed on prem with the right support bundle!

Plugging https://BareMetalSavings.com

in case you want to ballpark-estimate your move off of the cloud

Bonus points: I'm a Fastmail customer, so it tangentially tracks

----

Quick note about the article: ZFS encryption can be flaky, be sure you know what you're doing before deploying for your infrastructure.

Relevant Reddit discussion: https://www.reddit.com/r/zfs/comments/1f59zp6/is_zfs_encrypt...

A spreadsheet of related issues that I can't remember who made:

https://docs.google.com/spreadsheets/d/1OfRSXibZ2nIE9DGK6sww...

Yeah, we know about the ZFS encryption with send/receive bug, it's frustrating our attempts to get really nice HA support on our logging system... but so far it appears that just deleting the offsending snapshot and creating a new one works, and we're funding some research into the issue as well.

This is the current script - it runs every minute for each pool synced between the two log servers: https://gist.github.com/brong/6a23fee1480f2d62b8a18ade5aea66...

Thanks for sharing!

[deleted]

My main issue with ZFS encryption is that it only supports one key.

LUKS2 has something like 9 key slots.

I run ZoL over LUKS2 and it works great.

Such an awesome article. I like how they didn't just go with the Cloud wave but kept sysadmin'ing, like ol' Unix graybeards. Two interesting things they wrote about their SSDs:

1) "At this rate, we’ll replace these [SSD] drives due to increased drive sizes, or entirely new physical drive formats (such E3.S which appears to finally be gaining traction) long before they get close to their rated write capacity."

and

2) "We’ve also anecdotally found SSDs just to be much more reliable compared to HDDs (..) easily less than one tenth the failure rate we used to have with HDDs."

To avoid sysadmin tasks, and keep costs down, you've got to go so deep in the cloud, that it becomes just another arcane skill set. I run most of my stuff on virtual Linux servers, but some on AWS, and that's hard to learn, and doesn't transfer to GCP or Azure. Unless your needs are extreme, I think sysadmin'ing is the easier route in most cases.

For so many things the cloud isn't really easier or cheaper, and most cloud providers stopped advertising it as such. My assumption is that cloud adoption is mainly driven by 3 forces:

- for small companies: free credits

- for large companies: moving prices as far away as possible from the deploy button, allowing dev and it to just deploy stuff without purchase orders

- self-perpetuating due to hype, cv-driven development, and ease of hiring

All of these are decent reasons, but none of them may apply to a company like fastmail

Also CYA. If you run your own servers and something goes wrong its your fault. if its an outage at AWS its their fault.

Also a huge element of follow the crowd, branding non-technical management are familiar with, and so on. I have also found some developers (front end devs, or back end devs who do not have sysadmin skills) feel cloud is the safe choice. This is very common for small companies as they may have limited sysadmin skills (people who know how to keep windows desktops running are not likely to be who you want to deploy servers) and a web GUI looks a lot easier to learn.

> If its an outage at AWS its their fault.

Well, still your fault, but easy to judo the risk into clients saying supporting multi-cloud is expensive and not a priority.

Management in many places will not even know what multi-cloud is (or even multi-region).

As Cloudstrike showed, if you follow the crowd and tick the right boxes you will not be blamed.

nit: Crowdstrike

Unless the incident is now being referred to as “Cloudstrike”, in which case, eww

Yeah, he meant Crowdstrike. Cloudstrike is the name of a future security incident affecting multiple cloud provides. I can't disclose more details.

In small companies, cloud also provides the ability to work around technical debt and to reduce risk.

For example, I have seen several cases where poorly designed systems that unexpectedly used too much memory, and there was no time to fix it, so the company increased the memory on all instances with a few clicks. When you need to do this immediately to avoid a botched release that has already been called "successful" and announced as such to stakeholders, that is a capability that saves the day.

An example of de-risking is using a cloud filesystem like EFS to provide a pseudo-infinite volume. No risk of an outage due to an unexpectedly full disk.

Another example would be using a managed database system like RDS vs self-managing the same RDBMS: using the managed version saves on labor and reduces risk for things like upgrades. What would ordinarily be a significant effort for a small company becomes automatic, and RDS includes various sanity checks to help prevent you from making mistakes.

The reality of the industry is that many companies are just trying to hit the next milestone of their business by a deadline, and the cloud can help despite the downsides.

> For example, I have seen several cases where poorly designed systems that unexpectedly used too much memory

> using a managed database system like RDS vs self-managing the same RDBMS: using the managed version saves on labor

As a DBRE / SRE, I can confidently assert that belief in the latter is often directly responsible for the former. AWS is quite clear in their shared responsibility model [0] that you are still responsible for making sound decisions, tuning various configurations, etc. Having staff that knows how to do these things often prevents the poor decisions from being made in the first place.

[0]: https://aws.amazon.com/compliance/shared-responsibility-mode...

Not a DB admin, but I do install and manage DBs for small clients.

My experience is that AWS makes the easy things easy and the difficult things difficult, and the knowledge is not transferable.

With a CLI or non-cloud management tools I can create, admin and upgrade a database (or anything else) exactly the same way, locally, on a local VM, and on a cloud VM from any provider (including AWS). Doing it with a managed database means learning how the provider does it - which takes longer and I personally find it more difficult (and stressful).

What I cannot do as well as a real DB admin could do is things like tuning. Its not really an issue for small clients (a few generic changes to scale settings to available resources is enough - and cheaper than paying someone to tune it). Come to think of it, I do not even know how to make those changes on AWS and just hope the defaults match the size of RDS you are paying for (and change when you scale up?).

having written the above I am now doubting whether I have done the right thing in the past.

There are other, if often at least tangentially related, reasons but more than I can give justice to in a comment.

Many people largely got a lot of things wrong about cloud that I've been meaning to write about for a while. I'll get to it after the holidays. But probably none more than the idea that massive centralized computing (which was wrongly characterized as a utility like the electric grid) would have economics with which more local computing options could never compete.

A cloud is really easy to get started with.

Free tiers, startup credits, easily available managed databases, queues, object storage, lambdas, load-balancing, DNS, TLS, specialist stuff like OCR. It's easy to prototype something, run for free or for peanuts, start getting some revenue.

Then, as you grow, the costs become steeper, but migrating off of the cloud looks even more expensive, especially if you have accumulated a lot of data (egress costs you, especially from AWS). Congrats, you have become the desirable, typical cloud customer.

I'm very interested in approaches that avoid cloud, so please don't read this as me saying cloud is superior. I can think of some other advantages of cloud:

- easy to setup different permissions for users (authorisation considerations).

- able to transfer assets to another owner (e.g., if there's a sale of a business) without needing to move physical hardware.

- other outsiders (consultants, auditors, whatever) can come in and verify the security (or other) of your setup, because it's using a standard well known cloud platform.

Those are valid reasons, but not always as straight forward:

> easy to setup different permissions for users (authorisation considerations)

Centralized permission management is an advantage of the cloud. At the same time it's easy to do wrong. Without the cloud you usually have more piecemeal solutions depending on segmenting network access and using the permission systems of each service

> able to transfer assets to another owner (e.g., if there's a sale of a business) without needing to move physical hardware

The obvious solution here is to not own your hardware but to rent dedicated servers. Removes some of the maintenance burden, and the servers can be moved between entities as you like. The cloud does give you more granularity though

> other outsiders (consultants, auditors, whatever) can come in and verify the security (or other) of your setup, because it's using a standard well known cloud platform

There is a huge cottage industry of software trying to scan for security issues in your cloud setups. On the one hand that's an advantage of a unified interface, on the other hand a lot of those issues wouldn't occur outside the cloud. In any case, verifying security isn't easy in or out of the cloud. But if you have an auditor that is used to cloud deployments it will be easier to satisfy them there, that's certainly true

[deleted]

I predict a slow but unstoppable comeback of the sysadmin job over the next 5-10 years.

It never disappeared in some places. In my region there's been zero interest in "the cloud" because of physical remoteness from all major GCP/AWS/Azure datacenters (resulting in high latency), for compliance reasons, and because it's easier and faster to solve problems by dealing with a local company than pleading with a global giant that gives zero shits about you because you're less than a rounding error in its books.

> it becomes just another arcane skill set

Its an arcane skill set with a GUI. It makes it look much easier to learn.

My beard isn't entirely grey yet!

The new NVMe drives we've only had for a few years, but so far there's only been a single failure across the whole fleet, and we keep spares in stock. It's been very reliable, not like the weeks back in (hmm, 2006? 2007?) the ancient past, when we were losing 15kRPM velociraptors every other day. They had a firmware fault and we eventually got an update which made them reliable, but it was a wild few months.

A few more than one, but it has been a lot less than when we were dealing with spinner. I think I requested about one or two replacements a year, a far cry from the one a week I was doing before.

If I didn't see it, it didn't happen...

OK, I stand corrected. We lose one or two NVMe drives per year :)

Can I ask which brands / models of SSD are you using?

We didn’t replace all servers at once, it was progressive, therefore due to availability the models we used changed over time.

Our first batch of all nvme machines had SSDPE2KX080T8, they became harder to source and we moved to SSDPF2KX076T1.

With Intel no longer in the ssd business I believe we have some Micron MTFDKCC7T6TGH and MTFDKCC30T7TGR. And as mentioned in the blog post we've recently purchased some Solidigm D5-P5336 which are 61TB monsters.

Here's a fun related story. Our supplier had so much trouble finding SSDPE2KX080T8 that when we had exhausted our spares, I had to sync everything off a machine, tear it down and pull its drives for spares and rebuild it with the smaller SSDPF2KX076T1. Then we had lots of spares

SSD's are also a bit of an achilles heel for AWS -- they have their own Nitro firmware for wear levelling and key rotations, due to the hazards of multitenant. It's possible for one EC2 tenant to use up all the write cycles and then pass it to another, and encryption with key rotation is required to keep data from leaking across tenant changes. It's also slower.

We had one outage where key rotation had been enabled on reboot, so data partitions were lost after what should have been a routine crash. Overall, for data warehousing, our failure rate on on-prem (DC-hosted) hardware was lower IME.

The power of Moore's law.

I don't see how point 2 could have come as a surprise to anyone.

The fact that Fastmail work like this, are transparent about what they're up to and how they're storing my email and the fact that they're making logical decisions and have been doing so for quite a long time is exactly the reason I practically trip over myself to pay them for my email. Big fan of Fastmail.

I recently officially became a Fastmail user when pobox.com transitioned to Fastmail, and was very impressed with customer service when I had a technical question.

They are also active in contributing to cyrus-imap

I have seen a common sentiment that self hosting is almost always better than cloud. What these discussions does not mention is how to effectively run your business applications on this infrastructure.

Things like identity management (AAD/IAM), provisioning and running VMs, deployments. Network side of things like VNet, DNS, securely opening ports etc. Monitoring setup across the stack. There is so much functionalities that will be required to safely expose an application externally that I can't even coherently list them out here. Are people just using Saas for everything (which I think will defeat the purpose of on-prem infra) or a competent Sys admin can handle all this to give a cloud like experience for end developers?

Can someone share their experience or share any write ups on this topic?

For more context, I worked at a very large hedge fund briefly which had a small DC worth of VERY beefy machines but absolutely no platform on top of it. Hosting application was done by copying the binaries on a particular well known machine and running npm commands and restarting nginx. Log a ticket with sys admin to create a DNS entry to point a reserve and point a internal DNS to this machine (no load balancer). Deployment was a shell script which rcp new binaries and restarts nginx. No monitoring or observability stack. There was a script which will log you into a random machine for you to run your workloads (be ready to get angry IMs from more senior quants running their workload in that random machine if your development build takes up enough resources to effect their work). I can go on and on but I think you get the idea.

We're (very boring I know) just putting it all in a git repository with a Makefile which deploys it, plus some basic orchestration to run 'make diff' across the cluster and see what's out of sync, and 'make install' across hosts to deploy it into pl ace.

It's clunky, but simple, repeatable, and easily (vsfo) understood.

As for the bigger things, software etc - we have scripts that generate Debian packages which we store in our own private repo. You just install `fastmail-server` and the dependency management updates everything. There's a daily cronjob which checks if there are updated security packages or thing we failed to correctly deploy and emails us as well.

It's amazing what you can build on top of the OS provided tools with not too much complexity if you don't overthink it.

> identity management (AAD/IAM)

Do you mean for administrative access to the machines (over SSH, etc) or for "normal" access to the hosted applications?

Admin access: Ansible-managed set of UNIX users & associated SSH public keys, combined with remote logging so every access is audited and a malicious operator wiping the machine can't cover their tracks will generally get you pretty far. Beyond that, there are commercial solutions like Teleport which provide integration with an IdP, management web UI, session logging & replay, etc.

Normal line-of-business access: this would be managed by whatever application you're running, not much different to the cloud. But if your application isn't auth-aware or is unsafe to expose to the wider internet, you can stick it behind various auth proxies such as Pomerium - it will effectively handle auth against an IdP and only pass through traffic to the underlying app once the user is authenticated. This is also useful for isolating potentially vulnerable apps.

> provisioning and running VMs

Provisioning: once a VM (or even a physical server) is up and running enough to be SSH'd into, you should have a configuration management tool (Ansible, etc) apply whatever configuration you want. This would generally involve provisioning users, disabling some stupid defaults (SSH password authentication, etc), installing required packages, etc.

To get a VM to an SSH'able state in the first place, you can configure your hypervisor to pass through "user data" which will be picked up by something like cloud-init (integrated by most distros) and interpreted at first boot - this allows you to do things like include an initial SSH key, create a user, etc.

To run VMs on self-managed hardware: libvirt, proxmox in the Linux world. bhyve in the BSD world. Unfortunately most of these have rough edges, so commercial solutions there are worth exploring. Alternatively, consider if you actually need VMs or if things like containers (which have much nicer tooling and a better performance profile) would fit your use-case.

> deployments

Depends on your application. But let's assume it can fit in a container - there's nothing wrong with a systemd service that just reads a container image reference in /etc/... and uses `docker run` to run it. Your deployment task can just SSH into the server, update that reference in /etc/ and bounce the service. Evaluate Kamal which is a slightly fancier version of the above. Need more? Explore cluster managers like Hashicorp Nomad or even Kubernetes.

> Network side of things like VNet

Wireguard tunnels set up (by your config management tool) between your machines, which will appear as standard network interfaces with their own (typically non-publicly-routable) IP addresses, and anything sent over them will transparently be encrypted.

> DNS

Generally very little reason not to outsource that to a cloud provider or even your (reputable!) domain registrar. DNS is mostly static data though, which also means if you do need to do it in-house for whatever reason, it's just a matter of getting a CoreDNS/etc container running on multiple machines (maybe even distributed across the world). But really, there's no reason not to outsource that and hosted offerings are super cheap - so go open an AWS account and configure Route53.

> securely opening ports

To begin with, you shouldn't have anything listening that you don't want to be accessible. Then it's not a matter of "opening" or closing ports - the only ports that actually listen are the ones you want open by definition because it's your application listening for outside traffic. But you can configure iptables/nftables as a second layer of defense, in case you accidentally start something that unexpectedly exposes some control socket you're not aware of.

> Monitoring setup across the stack

collectd running on each machine (deployed by your configuration management tool) sending metrics to a central machine. That machine runs Grafana/etc. You can also explore "modern" stuff that the cool kids play with nowadays like VictoriaMetrics, etc, but metrics is mostly a solved problem so there's nothing wrong with using old tools if they work and fit your needs.

For logs, configure rsyslogd to log to a central machine - on that one, you can have log rotation. Or look into an ELK stack. Or use a hosted service - again nothing prevents you from picking the best of cloud and bare-metal, it's not one or the other.

> safely expose an application externally

There's a lot of snake oil and fear-mongering around this. First off, you need to differentiate between vulnerabilities of your application and vulnerabilities of the underlying infrastructure/host system/etc.

App vulnerabilities, in your code or dependencies: cloud won't save you. It runs your application just like it's been told. If your app has an SQL injection vuln or one of your dependencies has an RCE, you're screwed either way. To manage this you'd do the same as you do in cloud - code reviews, pentesting, monitoring & keeping dependencies up to date, etc.

Infrastructure-level vulnerabilities: cloud providers are responsible for keeping the host OS and their provided services (load balancers, etc) up to date and secure. You can do the same. Some distros provide unattended updates (which your config management tool) can enable. Stuff that doesn't need to be reachable from the internet shouldn't be (bind internal stuff to your Wireguard interfaces). Put admin stuff behind some strong auth - TLS client certificates are the gold standard but have management overheads. Otherwise, use an IdP-aware proxy (like mentioned above). Don't always trust app-level auth. Beyond that, it's the usual - common sense, monitoring for "spooky action at a distance", and luck. Not too much different from your cloud provider, because they won't compensate you either if they do get hacked.

> For more context, I worked at a very large hedge fund briefly which had a small DC worth of VERY beefy machines but absolutely no platform on top of it...

Nomad or Kubernetes.

No, using Ansible to distribute public keys does not get you very far. It's fine for a personal project or even a team of 5-6 with a handful, but beyond that you really need a better way to onboard, offboard, and modify accounts. If you're doing anything but a toy project, you're better off starting off with something like IPA for host access controls.

Why do think that? I did something similar at a previous work for something bordering on 1k employees.

User administration was done by modifying a yaml file in git. Nothing bad to say about it really. It sure beats point-and-click Active Directory any day of the week. Commit log handy for audits.

If there are no externalities demanding anything else, I'd happily do it again.

There is nothing _wrong_ with it, and so long as you can prove that your offboarding is consistent and quick then feel free to use it.

But a central system that uses the same identity/auth everywhere is much easier to keep consistent and fast. That’s why auditors and security professionals will harp on idp/sso solutions as some of the first things to invest in.

I found that the commit log made auditing on- and offboarding easier, not harder. Of course it won't help you if your process is dysfunctional. You still have to trigger the process somehow, which can be a problem in itself when growing from a startup, but once you do that it's smooth.

However git is a central system, a database if you will, where you can keep identities globally consistent. That's the whole point. In my experience, the reason people leave it is because you grow the need to interoperate with third party stuff which only supports AD or Okta or something. Should I get to grow past that phase myself I would feed my chosen IdM with that data instead.

What's the risk you're trying to protect against, that a "better" (which one?) way would mitigate that this one wouldn't?

> IPA

Do you mean https://en.wikipedia.org/wiki/FreeIPA ? That seems like a huge amalgamation of complexity in a non-memory-safe language that I feel like would introduce a much bigger security liability than the problem it's trying to solve.

I'd rather pony up the money and use Teleport at that point.

It's basically Kerberos and an LDAP server, which are technologies old and reliable as dirt.

This sort of FUD is why people needlessly spend so much money on cloud.

> which are technologies old and reliable as dirt.

Technologies, sure. Implementations? Not so much.

I can trust OpenSSH because it's deployed everywhere and I can be confident all the low-hanging fruits are gone by now, and if not, its widespreadness means I'm unlikely to be the most interesting target, so I am more likely to escape a potential zero-day unscathed.

What't the marketshare of IPA in comparison? Has it seen any meaningful action in the last decade years, and the same attention, from both white-hats (audits, pentesting, etc) as well as black-hats (trying to break into every exposed service)? I very much doubt it, so the safe thing to assume is that it's nowhere as bulletproof as OpenSSH and that it's more likely for a dedicated attacker to find a vuln there.

MIT's Kerberos 5 implementation is 30 years old and has been very widely deployed.

Aside: Fastmail was the best email provider I ever used. The interface was intuitive and responsive, both on mobile and web. They have extensive documentation for everything. I was able to set up a custom domain and and a catch-all email address in a few minutes. Customer support is great, too. I emailed them about an issue and they responded within the hour (turns out it was my fault). I feel like it's a really mature product/company and they really know what they're doing, and have a plan for where they're going.

I ended up switching to Protonmail, because of privacy (Fastmail is within the Five Eyes (Australia)), which is the only thing I really like about Protonmail. But I'm considering switching back to Fastmail, because I liked it so much.

Their Android client has been less than stellar in the past but recent releases are significantly improved. Uploading files, in particular, was a crapshoot.

I also chose Proton for the same reason. It hurts that their product development is glacial but that's a crucial component that I don't understand why Fastmail doesn't try to offer.

Lots of people here mentioning reasons to both use and avoid the cloud. I'll just chip in one more on the pro-cloud side: reliability at low scale.

To expand: At $dayjob we use AWS, and we have no plans to switch because we're tiny, like ~5000 DAU last I checked. Our AWS bill is <$600/mo. To get anything remotely resembling the reliability that AWS gives us we would need to spend tens of thousands up-front buying hardware, then something approximating our current AWS bill for colocation services. Or we could host fully on-prem, but then we're paying even more up-front for site-level stuff like backup generators and network multihoming.

Meanwhile, RDS (for example) has given us something like one unexplained 15-minute outage in the last six years.

Obviously every situation is unique, and what works for one won't work for another. We have no expectation of ever having to suddenly 10x our scale, for instance, because we our growth is limited by other factors. But at our scale, given our business realities, I'm convinced that the cloud is the best option.

This is a common false dichotomy I see constantly. Cloud vs, buy and build your own hardware from scratch and colocate/build own datacenter.

Very few non-cloud users are buying their own hardware. You can simply rent dedicated hardware in a datacenter. For significantly cheaper than anything in the cloud. That being said, certain things like object storage, if you don't need very large amounts of data, are very handy and inexpensive from cloud services considering the redundancy and uptime they offer.

This works even at $1M/mo AWS spend. As you scale, the discounts get better. You get into the range of special pricing where they will make it work against your P&L. If you’re venture funded, they have a special arm that can do backflips for you.

I should note that Microsoft also does this.

Love this article and I'm also running some stuff on old enterprise servers in some racks somehwere. Now over the last year I've had to dive into Azure Cloud as we have customers using this (b2b company) and I finally understood why everyone is doing cloud despite the price:

Global permissions, seamless organization and IaC. If you are Fastmail or a small startup - go buy some used dell poweredge with epycs in some Colo rack with 10Gbe transit and save tons of money.

If you are a company with tons of customers, ton's of requirements it's powerful to put each concern into a landing zone, run some bicep/terraform - have a ressource group to control costs and get savings on overall core-count and be done with it.

Assign permissions into a namespace for your employe or customer - have some back and forth about requirements and it's done. No need to sysadmin across servers. No need to check for broken disks.

I'm also blaming the hell of vmware and virtual machines for everything that is a PITA to maintain as a sysadmin but is loved because it's common knowledge. I would only do k8s on bare-metal today and skip the whole virtualization thing completly. I guess it's also these pains that are softened in the cloud.

Why is it surprising? It's well known cloud is 3 times the price.

Because the default for companies today is cloud, even though it almost never makes sense. Sure, if you have really spikey load, need to dynamically scale at any point and don't care about your spend, it might make sense.

Ive even worked in companies where the engineering team spent effort and time on building "scalable infrastructure" before the product itself even found product-market fit...

Nobody said it's surprising though, they are well aware of it having done it for more than two decades. Many newcomers are not aware of it though, as their default is "cloud" and they never even shopped for servers, colocation or looked around on the dedicated server market.

I don't think they're not just aware. But purely from scaling and distribution perspective it'd be wiser to start on cloud while you're still on the product-market fit phase. Also 'bare metal' requires more on the capex end and with how our corporate tax system is set it's just discouraging to go on this lane first and it'd be better off to spend on acquiring clients.

Also I'd guess a lot of technical founders are more familiar with cloud/server-side than with dealing or delegating sysadmin taks that might require adding members to the team.

I agree, the cloud definitely has a lot of use cases and when you are building more complicated systems it makes sense to just have to do a few clicks to get a new stack setup vs. having someone evaluate solutions and getting familiar with operating them on a deep level (backups etc.).

Would be interesting to know how files get stored. They don't mention any distributed FS solutions like SeaweedFS so once a drive is full, does the file get sent to another one via some service? Also ZFS seems an odd choice since deletions (esp of small files) at +80% full drive are crazy slow.

Unlike ext4 that locks the directory when unlinking, ZFS is able to scale on parallel unlinking. In specific, ZFS has range locks that permit directory entries to be removed in parallel from the extendible hash trees that store them. While this is relatively slow for sequential workloads, it is fast on parallel workloads. If you want to delete a large directory subtree fast on ZFS, do the rm operations in parallel. For example, this will run faster on ZFS than a naive rm operation:

  find /path/to/subtree -name -type f | parallel -j250 rm --
  rm -r /path/to/subtree
A friend had this issue on spinning disks the other day. I suggested he do this and the remaining files were gone in seconds when at the rate his naive rm was running, it should have taken minutes. It is a shame that rm does not implement a parallel unlink option internally (e.g. -j), which would be even faster, since it would eliminate the execve overhead and likely would eliminate some directory lookup overhead too, versus using find and parallel to run many rm processes.

For something like fast mail that has many users, unlinking should be parallel already, so unlinking on ZFS will not be slow for them.

By the way, that 80% figure has not been true for more than a decade. You are referring to the best fit allocator being used to minimize external fragmentation under low space conditions. The new figure is 96%. It is controlled by metaslab_df_free_pct in metaslab.c:

https://github.com/openzfs/zfs/blob/zfs-2.2.0/module/zfs/met...

Modification operations become slow when you are at/above 96% space filled, but that is to prevent even worse problems from happening. Note that my friend’s pool was below the 96% threshold when he was suffering from a slow rm -r. He just had a directory subtree with a large amount of directory entries he wanted to remove.

For what it is worth, I am the ryao listed here and I was around when the 80% to 96% change was made:

https://github.com/openzfs/zfs/graphs/contributors

I discovered this yesterday! Blew my mind. I had to check 3 times that the files were actually gone and that I specified the correct directory as I couldn't believe how quick it ran. Super cool

Unlinking gets done asynchronously on the weekends from Cyrus, using the `cyr_expire` tool. Right now it only runs one unlinking process at a time on the whole machine due to historical ext4 issues ... but maybe we should revisit that now we're on ZFS and NVMe. Thanks for the reminder.

Thank you very much for sharing this, very insightful.

Thank you for posting your original comment. The process of writing my reply gave me a flash of inspiration:

https://github.com/openzfs/zfs/pull/16896

I doubt that this will make us as fast as ext4 at unlinking files in a single thread, but it should narrow the gap somewhat. It also should make many other common operations slightly faster.

I had looked into range lock overhead years ago, but when I saw the majority of time entering range locks was spent in an “unavoidable” memory allocation, I did not feel that making the operations outside the memory allocation faster would make much difference, so I put this down. I imagine many others profiling the code came to the same conclusion. Now that the memory allocation overhead will soon be gone, additional profiling might yield further improvements. :)

The open-source Cyrus IMAP server which they mention using, has replication built-in. ZFS also has built-in replication available.

Deletion of files depends on how they have configured the message store - they may be storing a lot of data into a database, for example.

ZFS replication is quite unreliable when used with ZFS native encryption, in my experience. Didn't lose data but constant bugs.

Yeah, we're only using ZFS replication for logs; we're using the Cyrus replication for emails because it has other sanity checks and data model consistency enforcement which is really valuable.

(And both are async. We'd need something like drbd for real synchronous replication)

Keeping enough free space should be much less of a problem with SSDs. They can tune it so the array needs to be 95% full before the slower best-fit allocator kicks in. https://openzfs.readthedocs.io/en/latest/performance-tuning....

I think that 80% figure is from when drives were much smaller and finding free space over that threshold with the first-fit allocator was harder.

Emails are stored in cyrus-imapd.

For now, the "file storage" product is a Node tree in mysql, with content stored in a content-addressed blob store, which is some custom crap I wrote 15 years ago that is still going strong because it's so simple there's not much to go wrong.

We do plan to eventually move the blob storage into Cyrus as well though, because then we have a single replication and backup system rather than needing separate logic to maintain the blob store.

I was told Fastmail is excellent, and I am not a big fan of gmail. Once locked out for good in gmail, your email and apps associated with it, are gone forever. Source? Personal experience.

"A private inbox $60 for 12 months". I assume it is USD, not AU$ (AFAIK, Fastmail is based in Australia.) Still pricey.

At https://www.infomaniak.com/ I can buy email service for an (in my case external) domain for 18 Euro a year and I get 5 inboxes. And it is based in Switzerland, so no EU or US jurisdiction.

I have a few websites and fastmail would just be prohibitive expensive for me.

You can have as many domains as you want for free in your Fastmail account. There are no extra fees.

I've used them for 20 years now. Highly recommended.

Wait, really? I pay for two separate domains. What am I missing?

I'm happy to pay them because I love the service (and it's convenient for taxes), but I feel like I should know how to configure multiple domains under one account.

Under Settings => Domains you can add additional domains. If you use Fastmail as domain registrar you have to pay for each additional domain, of course.

Personally I prefer Migadu and tend to recommend them to tech savvy people. Their admin panel is excellent and straightforward to use, prices are based on usage limits (amount of emails sent/received) instead of number of mailboxes.

Migadu is just all around good, only downsides I can find are subjective. The fact that they're based in Switzerland and unless you're "good with computers" something like Fastmail will probably be better.

Seems Migadu is hosted on OVH though? Huge red flag.. no control over infrastructure (think of Hetzner shutting down customers with little to no warning)

My suggestion would be to try Purelymail. They don't offer much in the way of a web interface to email, but if you bring your own client, it's a very good provider.

I'm paying something like $10 per year for multiple domains with multiple email addresses (though with little traffic). I've been using them for about 5 years and I had absolutely no issues.

Purelymail is just one person show. May that one person live long and prosper, but I am not putting my faith or email in that business.

Why do you need to put faith in them? Switching email providers is just a DNS change away, and email messages can be stored locally - actually it's encouraged to do so.

Pricing it hard to understand: https://purelymail.com/advancedpricing

Usernames on shared domains:

1 to 6 letters: $0.20 per user per year 7 to 12 letters: $0.05 per user per year 13+ letters: $0.02 per user per year

WTF?

I have no idea why that's a thing. :D

Personally I use the simple pricing scheme and looking at billing page, I pay around ~$0.35-0.4 monthly for 5 domains, with 4 explicitly set email addresses and catch-all for all domains to a common mailbox. Also I must state again, there is quite little traffic on all.

[deleted]

Didn’t see this in the article, do they have multi az redundancy? I.e. if the entire raid goes up in flames what’s the recovery process?

Looks like they do mention that elsewhere: https://www.fastmail.com/features/reliability/

> Fastmail has some of the best uptime in the business, plus a comprehensive multi data center backup system. It starts with real-time replication to geographically dispersed data centers, with additional daily backups and checksummed copies of everything. Redundant mirrors allow us to failover a server or even entire rack in the case of hardware failure, keeping your mail running.

Yeah, that makes me feel uneasy as a long time fastmail user.

I am working on a personal project(some would call it startup, but i have no intention of getting external financing and other americanisms) where i have set up my own cdn and video encoding, among other things. These days, whenever you have a problem, everyone answers "just use cloud" and that results in people really knowing nothing any more. It is saddening. But on the other hand it ensures all my decades of knowledge will be very well paid in the future, if i'd need to get a job.

I like this writeup, informative and to-the-point.

Today, the cloud isn’t about other people’s hardware.

It’s about infrastructure being an API call away. Not just virtual machines but also databases, load-balancers, storage, and so on.

The cost isn’t the DC or the hardware, but the hours spend on operations.

And you can abuse developers to do operations on the side :-)

And then come the weird aspects of bad cloud service providers, like IONOS, who have broken OS images, a provisioning API, that is a bottleneck, where what other people do and how much they do can slow down your own provisioning and creating network interfaces can take minutes via their API and their customer services says "That's how it is, cannot change it.", and you get a very shitty web user interface, that desperately tries to be a single page app, yet has all the default browser functionality like the back button broken. Yet they still cost literally 10x what Hetzner cloud costs, while Hetzner basically does everything better.

And then it is still also about other people's hardware in addition to that.

if you don't have high bandwidth requirements, like for background / batch processing, the ovh eco family [1] of bare metal servers is incredibly cheap

[1] https://eco.ovhcloud.com/en/

FYI - Fastmail web client has Offline support in beta right now.

https://www.fastmail.com/blog/offline-in-beta/

And if anyone is curious, I actually live on their https://betaapp.fastmail.com release and find it just as stable as the "mainline" one but with the advantage of getting to play with all the cool toys earlier. Bonus points (for me) in that they will periodically conduct surveys to see how you like things

Omg it's 1970 and we have IMAP now... Oh wait...

Very confused by this. What is in beta? I've had "offline" email access for 25 years. It's called an IMAP client.

[flagged]

Hey, this response makes you look like an adolescent asshole. Parent poster was clearly asking about prioritization.

[flagged]

Being an asshole is being a moron.

Obviously untrue.

I absolutely love Fastmail. I moved off of Gmail years ago with zero regrets. Better UI, better apps, better company, and need I say better service? I still maintain and fetch from a Gmail account so it all just works seamlessly for receiving and sending Gmail, so you don’t have to give anything up either.

I moved from my own colocated 1U running Mailcow to Fastmail and don't regret it one bit. This was an interesting read, glad to see they think things through nice and carefully.

The only things I wish FM had are all software:

1. A takeout-style API to let me grab a complete snapshot once a week with one call

2. The ability to be an IdP for Tailscale.

1. hoping to have a JMAP archive format at some point which should cover that. I'd hope that normally you'd be fetching a delta update rather than the whole thing. We've got enough bandwidth for a few people do to it, but I wouldn't want every customer pulling their entire archive every week of 99% the same immutable data; that would be kinda sucky.

2. yeah, I'd love that too - we're keen to integrate with everything else that people are using. We have a basic in-house IdP thing for our own staff to authenticate against our hosted services, but haven't scaled it out. This will happen eventually, though I've been burned enough times I don't want to promise a timeframe.

I use Fastmail for my personal mail, and I don’t regret it, but I’m not quite as sold as you are, I guess maybe because I still have a few Google work accounts I need to use. Spam filtering in Fastmail is a little worse, and the search is _terrible_. The iOS app is usable but buggy. The easy masked emails are a big win though, and setting up new domains feels like less of a hassle with FM. I don’t regret using Fastmail, and I’d use them again for my personal email, but it doesn’t feel like a slam dunk.

100% this. I migrated from Gmail to Fastmail about 5 years ago and it has been rock solid. My only regret is that I didn't do it sooner.

Their UI is definitely faster but I do prefer the gmail UI, for example how new messages are displayed in threads is quite useless in fastmail.

Their android app has always been much snappier than Gmail, it's the little things that drew me to it years ago

I'm a little surprised it seems they didn't have some existing compression solution before moving to zfs. With so much repetitive text across emails I would think there would be a LOT to gain, such as from dictionaries, compressing many emails into bigger blobs, and fine-tuning compression options.

They use ZFS with zstd which likely compresses well enough.

Custom compression code can introduce bugs that can kill Fastmail's reputation of reliability.

It's better to use a well tested solution that cost a bit more.

Keen to move certain tasks to ZFS but not the ones that matter...

Frankly given emails are normally ~4kB objects I suspect the compression overheads are probably not that worth it unless it's for attachments only. Not attacking ZFS it's compression and checksumming are among best in class, but the compression would work better if it weren't limited to small files. Here ZFS has made a lot of wins I've not had a problem with many files on ZFS due to L1/L2 ARC but the cost is metadata ops can be painful on many small files.

The evidence they IOPS limited it that they went for SSD or better when they could store the same capacity on rust for much cheaper now.

Yeah I think moving the compression or file access up to abstract what is being written to disk ala protonmail (I don't like their offerings, but like their tech) means you can have compression over 4MB not 4kB blocks which matters when you recall data from disks for , I don't know... Backups or search?

also remember RAID!=backups ;)

I’ve started to host my own sites and stuff on an old MacBook in a cupboard with a shit old external hardware Ava microk8s and it’s great!

Another homelabber joins the ranks!!

Just implemented a dyndns system using K8s CronJobs + GitOps + CloudFlare Terraform, however next stage will be moving that over to CloudFlare tunnels which should be more reliable and nicer, fully within the Terraform and not relying on polling a random JSON IP service (which a terrifying SPOF)

What not many people talk about in the comments is how the hardware route is fairly stacked against smaller players. Large enterprises buy the same hardware as small and midsize businesses at a fraction of the cost, which significantly impacts the economics of this decision. Even if you have the capability and desire, if each server costs your business double what an enterprise would pay, it becomes less attractive pretty quickly.

> So after the success of our initial testing, we decided to go all in on ZFS for all our large data storage needs. We’ve now been using ZFS for all our email servers for over 3 years and have been very happy with it. We’ve also moved over all our database, log and backup servers to using ZFS on NVMe SSDs as well with equally good results.

If you're looking at ZFS on NVMe you may want to look at Alan Jude's talk on the topic, "Scaling ZFS for the future", from the 2024 OpenZFS User and Developer Summit:

* https://www.youtube.com/watch?v=wA6hL4opG4I

* https://openzfs.org/wiki/OpenZFS_Developer_Summit_2024

There are some bottlenecks that get in the way of getting all the performance that the hardware often is capable of.

Any ideas how they manage the ZFS encryption key? I've always wondered what you'd do in an enterprise production setting. Typing the password in at a prompt as any seem scalable (but maybe they have few enough servers that it's manageable) and keeping it in a file on disk or on removable storage would seem to defeat the purpose...

zfs encryption is still corrupting datasets when using zfs send/receive for backup (huge win for mail datasets), would be cautious about using it in production:

https://github.com/openzfs/zfs/issues/12014

I’ll never use ZFS in production after I was on a team that used it at petabyte scale. It’s too complex and tries to solve problems that should be solved at higher layers.

yeah, we use Cyrus replication still - it's protocol specific so it detects changes very efficiently as well, using the internal MODSEQ system also used for the JMAP /changes and IMAP CONDSTORE/QRESYNC.

Plus it has protocol consistency sanity checks built in.

Plus, I wrote it :p

Please stop using send/recv . Your backups should be based on non ZFS tech to avoid all your eggs in one basket. Yes send/recv is fine for immediate recovery, but other than block level replication for immediate (my server is now inside the tornado) recovery this isn't advised.

Also, who cares if a single filesystem dies, that's why you have inter-server replication. Nuke the bad server and rebuild before the next 3 or 4 die.

I think mailbox hosting is a special use case. The primary cost is storage and bandwidth and you can indeed do better on storage and bandwidth than what Amazon offers. That being said, if Fastmail asked Amazon for special pricing to make the move, they would get it.

Anyone know what are some good data centers or providers to host your bare metal servers?

You’re probably looking for the term “colo”

"WHY we use our own hardware..."

The why is is the interesting part of this article.

I take that back; this is (to me)t he most interesting part:

"Although we’ve only ever used datacenter class SSDs and HDDs failures and replacements every few weeks were a regular occurrence on the old fleet of servers. Over the last 3+ years, we’ve only seen a couple of SSD failures in total across the entire upgraded fleet of servers. This is easily less than one tenth the failure rate we used to have with HDDs."

I was a bit confused by the section on backups. How do they manage moving the data offsite with the on-premises backup servers? Wouldn’t that be a cost savings by going cloud?

gmail does spam filtering very well for me. fastmail on the other hands, puts lots of legit emails into spam folder. manually marking "not spam" doesn't help

other than that, i'm happy with fastmail.

If I look at my Gmail SPAM folder, there is very rarely something genuinely important in it. What there is a fair bit of though is random newsletters and announcements that I may have signed up for in some way shape or form that I don't really care about or generally look at. I assume they've been reported as SPAM by enough people rather than simply unsubscribed to that Google now labels them as such.

iCloud is just as bad, sends important things to spam constantly and marking as “not spam” has never done anything perceivable.

>fastmail on the other hands, puts lots of legit emails into spam folder. manually marking "not spam" doesn't help

Fastmail explicitly says that moving mail to/from a spam folder via a mail client does not automaticallyl retrain. <https://www.fastmail.help/hc/en-us/articles/1500000278142-Im...> (I never did figure out if Gmail acts the same way or not.)

Are those backups geographically distributed?

Yes.

The biggest win with running your own infra is disk/IO speeds, as noted here and in DHH's series on leaving cloud (https://world.hey.com/dhh/we-have-left-the-cloud-251760fb)

The cloud providers really kill you on IO for your VMs. Even if 'remote' SSDs are available with configurable ($$) IOPs/bandwidth limits, the size of your VM usually dictates a pitiful max IO/BW limit. In Azure, something like a 4-core 16GB RAM VM will be limited to 150MB/s across all attached disks. For most hosting tasks, you're going to hit that limit far before you max out '4 cores' of a modern CPU or 16GB of RAM.

On the other hand, if you buy a server from Dell and run your own hypervisor, you get a massive reserve of IO, especially with modern SSDs. Sure, you have to share it between your VMs, but you own all of the IO of the hardware, not some pathetic slice of it like in the cloud.

As is always said in these discussions, unless you're able to move your workload to PaaS offerings in the cloud (serverless), you're not taking advantage of what large public clouds are good at.

Biggest issue isn't even sequential speed but latency. In the cloud all persistent storage is networked and has significantly more latency than direct-attached disks. This is a physical (speed of light) limit, you can't pay your way out of it, or throw more CPU at it. This has a huge impact for certain workloads like relational databases.

I ran into this directly trying to use Azure's SMB as a service offering (Azure Files) for a file-based DB. It currently runs on a network share on-prem, but moving it to an Azure VM using that service killed performance. SMB is chatty as it is, and the latency of tons of small file IO was horrendous.

Interestingly, creating a file share VM deployed in the same proximity group has acceptable latency.

Yep. This is why my 12-year old Dell R620s with Ceph on NVMe via Infiniband outperform the newest RDS and Aurora instances: the disk latency is measured in microseconds. Locally attached is of course even faster.

I don't trust anything from fastmail after they bought pobox and forced me onto their new service which fails at the one thing pobox did well--forwarding email. They also refused to give me a refund (prorated or not) for removing the product I was using and substituting a defective one.

What problems have you had? I also came over from pobox and thought that the transition was quite straightforward.

Anything erroneously marked as spam can not be released to the forwarding address—-meaning they fail at their one job, forwarding email. Pobox had a great interface for quickly releasing messages to the forwarding address.

Pre-Fastmail, I did not have mail storage space at Pobox; just forwarding ability. I did not use Pobox's own interface for releasing spam mail; I used the standard filter (can't remember the exact name) and almost never saw nonspam in there (not that I checked often).

Post-Fastmail, I still forward from Pobox/Fastmail to the same other Google Workspace account from which I pull mail to my local system with `fetchmail`. I have Fastmail send all mail, spam or not; while the settings UI does not allow setting the spam protection level to "Off" when forwarding is used, the same thing can be achieved by using "Custom" then disabling "Move messages with a score of ___ or higher to Spam". I thus can let Google's spam filter deal with the inflow and, if necessary, manually sort miscategorized mail with my IMAP client.

You also terminate accounts at your sole discretion

everyone is 'cattle not pets' except the farm vet who is shoulder-deep in a cow

(my experience with managed kubernetes)

I've been doing this job for almost as long as they have. I work with companies that do on-prem, and I work with companies in the cloud, and both. Here's the low down:

1. The cost of the server is not the cost of on-prem. There are so many different kinds of costs that aren't just monetary. ("we have to do more ourselves, including planning, choosing, buying, installing, etc,") Those are tasks that require expertise (which 99% of "engineers" do not possess at more than a junior level), and time, and staff, and correct execution. They are much more expensive than you will ever imagine. Doing any of them wrong will causes issues that will eventually cost you business (customers fleeing, avoiding). That's much worse than a line-item cost.

2. You have to develop relationships for good on-prem. In order to get good service in your rack (assuming you don't hire your own cage monkey), in order to get good repair people for your hardware service accounts, in order to ensure when you order a server that it'll actually arrive, in order to ensure the DC won't fuck up the power or cooling or network, etc. This is not something you can just read reviews on. You have to actually physically and over time develop these relationships, or you will suffer.

3. What kind of load you have and how you maintain your gear is what makes a difference between being able to use one server for 10 years, and needing to buy 1 server every year. For some use cases it makes sense, for some it really doesn't.

4. Look at all the complex details mentioned in this article. These people go deep, building loads of technical expertise at the OS level, hardware level, and DC level. It takes a long time to build that expertise, and you usually cannot just hire for it, because it's generally hard to find. This company is very unique (hell, their stack is based on Perl). Your company won't be that unique, and you won't have their expertise.

5. If you hire someone who actually knows the cloud really well, and they build out your cloud env based on published well-architected standards, you gain not only the benefits of rock-solid hardware management, but benefits in security, reliability, software updates, automation, and tons of unique features like added replication, consistency, availability. You get a lot more for your money than just "managed hardware", things that you literally could never do yourself without 100 million dollars and five years, but you only pay a few bucks for it. The value in the cloud is insane.

6. Everyone does cloud costs wrong the first time. If you hire somebody who does have cloud expertise (who hopefully did the well-architected buildout above), they can save you 75% off your bill, by default, with nothing more complex than checking a box and paying some money up front (the same way you would for your on-prem server fleet). Or they can use spot instances, or serverless. If you choose software developers who care about efficiency, they too can help you save money by not needing to over-allocate resources, and right-sizing existing ones. (Remember: you'd be doing this cost and resource optimization already with on-prem to make sure you don't waste those servers you bought, and that you know how many to buy and when)

7. The major takeaway at the end of the article is "when you have the experience and the knowledge". If you don't, then attempting on-prem can end calamitously. I have seen it several times. In fact, just one week ago, a business I work for had three days of downtime, due to hardware failing, and not being able to recover it, their backup hardware failing, and there being no way to get new gear in quickly. Another business I worked for literally hired and fired four separate teams to build an on-prem OpenStack cluster, and it was the most unstable, terrible computing platform I've used, that constantly caused service outages for a large-scale distributed system.

If you're not 100% positive you have the expertise, just don't do it.

> 7. ... Another business I worked for literally hired and fired four separate teams to build an on-prem OpenStack cluster, and it was the most unstable, terrible computing platform I've used, that constantly caused service outages for a large-scale distributed system.

I've seen similarly unstable cloud systems. It's generally not the tool's fault, it's the skill of the wielder.

Yeah, we have good vendor relationships, good datacenter relationships, and we've made mis-steps along the way for sure. Own hardware isn't for everyone, but it's been great for us. YMMV

Yeah and some people reckon web frameworks are bad too. Sometimes it might make sense to host your on your own hardware but almost certainly not for startups.

Yeah, Cloud is a bit of a scam innit? Oxide is looking more and more attractive every day as the industry corrects itself from overspending on capabilities they would never need.

It’s trading time for money

Fake news. I've got my bare metal server deployed and installed with my ansible playbook even before you manage to log into the bazillion layers of abstraction that is AWS.

But can you do that on demand in minutes for 1000 application teams that have unique snowflake needs. Because terraform or bicep can.

In multiple regions?

Yes, welcome to business. But frankly an email provider needs to have their own metal, if they don't they're not worth doing business with

[dead]

[dead]

longtime FM user here

good on them, understanding infrastructure and cost/benefit is essential in any business you hope to run for the long haul

[flagged]

This last week, gmail failed to filter as spam an email with subject "#T Anitra", body,

> oF1 d 4440 - 2 B 32677 83

> R Teri E x E q

>

> k 50347733 Safoorabegum

and an attachment "7330757559.pdf". It let through 8 similar emails in the same week, and many more even more egregiously gibberish emails over the years. I'm not pleased with the quality of gmail's spam filter.

[deleted]

[dead]

I moved to FastMail three years ago, and, for a contrasting experience, found that spam filtering was almost on a par with Gmail. I had feared it would be otherwise.

my inbox at fastmail is near empty from spam. the main spam i see in my inbox is forwarded from my gmail.

That probably says more about the email address that’s out there than anything else.

Fastmail has wildcard email support, so it’s pretty easy to have an email per purchase you make (for example). This makes it easy to see who leaked your email to spammers. Anyway, I have nowhere near the volume of spam with Fastmail that I had with Gmail.

My point was not about wildcard emails, which Gmail also offers. Rather, the amount of spam you get is typically based on how well known your email address is to spammers. If someone’s not getting much spam, it usually just means they haven’t used their email address in places where they would get it. This is regardless of whether it’s a wildcard email or not.

Your comment is confusing because you start this one saying your inbox is full of spam, but respond to a suggestion to mark it as spam by saying it's not actually spam.

If something is not spam but you want it out of your inbox there's a few options:

- click Unsubscribe next to the sender. This should be possible for essentially all promotional email.

- click Actions -> click Block <sender>. Messages from this address will now immediately go to trash.

- click Actions -> click Add rule from message (-> optionally change the suggested conditions) -> check Archive (or if you don't use labels click Move to) -> click Save. Messages matching the conditions will now skip your inbox.

There's not much they could do to make that easier without magically knowing what you care about and what you don't.

I guess what's confusing is that I'm calling the promotional emails also "spam".

But thanks for your suggestions.

I see a few problems. When I receive a promotional email, I want to add a rule, and I have to click 7 times (including once for "Archive"), and use the scroll-wheel to select the "Promotions" label. Secondly, the rule is not applied directly. This is confusing, and cumbersome. Note: I don't want to Unsubscribe (because there may be vouchers), and I don't want to mark it as spam, for the same reason.

Another problem is that the amount of rules gets unwieldy this way. I have hundreds of rules already for promotional stuff and the rules I use for other (more important) stuff are hidden between them.

Maybe you think I am complaining too much, but in gmail it was all simple and automatic.

One rule that may be good, depending on how much of the mailing list email you receive is promotional, is to match "A header called List-Unsubscribe exists" and move that to Promotional. Then you could put any exceptions that it categorizes wrong above it.

That's a good idea. Although, oddly, some of the emails that have an Unsubscribe button have no List-Unsubscribe header.

How would you suggest to solve the following problem: let's say I have archived all my mail (inbox zero); how do I now see the emails that are important to me (i.e., everything that was not labeled e.g. with promotions)?

Gmail puts most of my email in the spam folder, including a lot of non-spam. Manually labeling it as non-spam is not helping.

Never had that after the first few years, but I hear other people do have that. Maybe it's because I used it for 2 decades now? I tried alternatives including fastmail but I always leave them because I get swamped by spam while gmail works fine.

There is a "Report Spam" function which is two clicks away (it's in the "More" menu).

I don't want to report everything as spam. For example, promotional emails from businesses that I bought something from. I don't want to punish those businesses; and those emails might contain vouchers that I could use later. But I want those emails moved out of the way without any action from my side.

That's like Spotify telling me "keep disliking" when I complained to them why songs from a certain language (which I never liked or listened to and I certainly don't speak) keeps filling the home after I told them in the first complaint that I have been doing that since months.

What can I say, "Report Spam" seems to work for me. I'm just a customer of Fastmail.

If you get 12 spam mails everyday and after 3 months of clicking "report spam" it still doesn't filter it, then it's not en par with Gmail.

If you meet someone new at a social event and give them your email address, where do you want your email provider to put the message that this person sent?

I get no spam on fastmail, I assume this is because I never give out my email to anyone and creating new ones for every interaction. This way I keep track of who I'm interacting with and also who's selling my alias emails.

Just wish there was a decent way to do this with mobile numbers!

Same, I religiously create a masked email for every website (just checked, it's now at 163!). I simply don't give my "main" email out.

Oddly enough, simply unsubscribing from the things websites themselves has kept thing clean, I've yet to notice any true spam from a random source aimed at any of my emails since I joined last year.

I would like to know the tech stack behind it.

There's various articles on our blog about our stack!

A mail-cloud provider uses its own hardware? Well, that’s to be expected, it would be a refreshing article if it was written by one of their customers.

No they deserve me praise for simply running their stuff on metal... Like a thousand unix sysadmins before and after

Cost isn’t always the most important metric. If that was the case, people would always buy the cheapest option of everything.

But what about the cost and complexity of a room with the racks and the cooling needs of running these machines? And the uninterrupted power setup? The wiring mess behind the racks.

There is a very competitive market for colo providers in basically every major metropolitan area in the US, Europe, and Asia. The racks, power, cooling, and network to your machines is generally very robust and clearly documented on how to connect. Deploying servers in house or in a colo is a well understood process with many experts who can help if you don’t have these skills.

Colo offers the ability to ship and deploy and keep latencies down if you're global, but if you're local yes you should just get someone on site and the modern equivalent of a T1 line setup to your premises if you're running "online" services.

I'm not fastmail but this is not rocket science. Has everyone forgotten how datacentre services work in 2024?

Yes they have and they feel they deserve credit for discovering a WiFi cable is more reliable to the new shiny kit that was sold to them by a vendor...

Own hardware doesn't mean own data center. Many data centers offer colocation.

Even for cloud providers, these are mostly other people's problems, eg: Equinix

Do colocation facilities solve that?

We at Control Plane (https://cpln.com) make it easy to repatriate from the cloud, yet leverage the union of all the services provided by AWS, GCP and Azure. Many of our customers moved from cloud A to cloud B, and often to their own colocation cage, and in one case their own home cluster. Check out https://repatriate.cloud

Hosts online service seems to think deserving of medal for discovering that S3 buckets from a cloud provider are crap and cost a fortune.

The heading in this space makes your think they're running custom FPGAs such as with Gmail, not just running on metal... As for drive failures, welcome to storage at scale. Build your solution so it's a weekly task to replace 10disks at a time not critical at 2am when a single disk dies...

Storing/Accessing tonnes of <4kB files is difficult, but other providers are doing this on their own metal with CEPH at the PB scale.

I love ZFS, it's great with per-disk redundancy but CEPH is really the only game in town for inter-rack/DC resilience which I would hope my email provider has.

Ceph is most certainly not the only game in town. It's good and stuff, but it's just tech. We're using protocol level replication for each of our data stores.

No, let's be honest. CEPH is the only solution for data management at this scale (sub to few PB). The solution which is independent of application or workload. The market share, fact IBM is moving people off other projects internally for this, and the massive backing shows this.

Yes you can have all or a bunch of these features like failure domains via other routes/products but none have all of the stuff together in one place like CEPH.

There's a reason people call it the "Linux of storage". The only alternatives are manage this at a higher level in your stack (reinventing the wheel) or buying PB level solutions from corporate which is like saying I'm buying Oracle and MS over Linux.

Protocol replication means you've reimplemented something which is storage related elsewhere in your stack. It's not incorrect to do so, but there exist better solutions and alternatives now.

I mean, I'm happy to have this argument. CEPH is content agnostic and that's fantastic most of the time. Cyrus replication is data aware, so it's not just replicating the data, it's doing integrity checking and data model consistency handling.

Most of all, it's doing split brain recovery; which - if we wanted CP rather than AP then we wouldn't need, but that wasn't the original design.

If I was redoing this from scratch, I'd maybe do Ceph or similar and update Cyrus to work well with it, but that would be a big change from the current design.

Anyway, I'm happy to stipulate that Ceph is great tech, without going and telling other people that it's the only choice.

Do you honestly think CEPH isn't doing data consistency handling? I'll pay for your ticket to cephalocon if you'll speak to that effect(!)

Split brain stuff only happens when you're splitting a single threaded task and put it back together. MDS in CEPH has this problem but that's so far into the weeds here as to be off topic.

Again you're implementing something storage not in storage and taking any storage. Fine if you want to do it that way, but talk about _that_ not hecking ZFS being mah saviour. (Btw daily driving and love that too but an email provider _relying_ on it should raise eye brows)...

I do believe we are talking past each other here. Of course ceph does data consistency, but it sure doesn't assert that a modseq is monotonically increasing or that an mailbox/uidvalidity/uid triple doesn't change digest, because it's not data-model aware.

sigh