News | drihu.com

By kflansburg, 13 hours ago

> an if let expression over an RWLock assumed (reasonably, but incorrectly) in its else branch that the lock had been released. Instant and virulently contagious deadlock.

I believe this behavior is changing in the 2024 edition: https://doc.rust-lang.org/edition-guide/rust-2024/temporary-...

By kibwen, 13 hours ago

> I believe this behavior is changing

Past tense, the 2024 edition stabilized in (and has been the default edition for `cargo new` since) Rust 1.85.

By kflansburg, 13 hours ago

Yes, I've already performed the upgrade for my projects, but since they hit this bug, I'm guessing they haven't.

By kibwen, 12 hours ago

They may have upgraded by now, their source links to a thread from a year ago, prior to the 2024 edition, which may be when they encountered that particular bug.

By kflansburg, 12 hours ago

I see now that this incident happened in September 2024 as well.

By ricardobeat, 13 hours ago

> Like an unattended turkey deep frying on the patio, truly global distributed consensus promises deliciousness while yielding only immolation

Their writing is so good, always a fun and enlightening read.

By natebrennand, 9 hours ago

> Finally, let’s revisit that global state problem. After the contagious deadlock bug, we concluded we need to evolve past a single cluster. So we took on a project we call “regionalization”, which creates a two-level database scheme. Each region we operate in runs a Corrosion cluster with fine-grained data about every Fly Machine in the region. The global cluster then maps applications to regions, which is sufficient to make forwarding decisions at our edge proxies.

This tier approach makes a lot of sense to mitigate the scaling limit per corrosion node. Can you share how much data you wind up tracking in each tier in practice?

How concise is the entry for each application -> [regions] table? Does the constraint of running this on every node mean that this creates a global limit for number of applications? It also seems like the region level database would have a regional limit for the number of Fly machines too?

By blinkingled, 11 hours ago

> The bidding model is elegant, but it’s insufficient to route network requests. To allow an HTTP request in Tokyo to find the nearest instance in Sydney, we really do need some kind of global map of every app we host.

So is this a case of wanting to deliver a differentiating feature before the technical maturity is there and validated? It's an acceptable strategy if you are building a lesser product but if you are selling Public Cloud maybe having a better strategy than waiting for problems to crop up makes more sense? Consul, missing watchdogs, certificate expiry, CRDT back filling nullable columns - sure in a normal case these are not very unexpected or to-be-ashamed-of problems but for a product that claims to be Public Cloud you want to think of these things and address them before day 1. Cert expiry for example - you should be giving your users tools to never have a cert expire - not fixing it for your stuff after the fact! (Most CAs offer API to automate all this - no excuse for it.)

I don't mean to be dismissive or disrespectful, the problem is challenging and the work is great - merely thinking of loss of customer trust - people are never going to trust a new comer that has issues like this and for that reason move fast break things and fix when you find isn't a good fit for this kind of a product.

By tptacek, 11 hours ago

It's not a "differentiating feature"; it eliminated a scaling bottleneck. It's also a decision that long predates Corrosion.

By blinkingled, 11 hours ago

I was referring to the "HTTP request in Tokyo to find the nearest instance in Sydney" part which felt to me like a differentiating feature- no other cloud provider seems to have bidding or HTTP request level cross regional lookup or whatever.

The "decision that long predates Corrosion" is precisely the point I was trying to make - was it made too soon before understanding the ramifications and/or having a validated technical solution ready? IOW maybe the feature requiring the problem solution could have come later? (I don't know much about fly.io and its features, so apologies if some of this is unclear/wrongly assumes things.)

By tptacek, 11 hours ago

That's literally the premise of the service and always has been.

By x0x0, 10 hours ago

fwiw, I'm happily running a company and some contract work on fly literally as aws, but what if it weren't the most massively complex pile of shit you've ever seen.

I have a couple reasonably sized, understandable toml files and another 100 lines of ruby that runs long-running rake tasks as individual fly machines. The whole thing works really nicely.

By LennyHenrysNuts, 2 hours ago

I left that site after reading the first half of the first line. Transmogrifies, indeed.

By andrethegiant, 2 hours ago

What’s wrong with it? It’s a great word

By IAmGraydon, an hour ago

That says more about you than the site.

By anentropic, 10 hours ago

blog posts should have a date at the top

By chrisweekly, 9 hours ago

YES. THIS. ALWAYS!

Huge pet peeve. At least this one has a date somewhere (at the bottom, "last updated Oct 22, 2025").

By soamv, 15 hours ago

> New nullable columns are kryptonite to large Corrosion tables: cr-sqlite needs to backfill values for every row in the table

Is this a typo? Why does it backfill values for a nullable column?

By ricardobeat, 12 hours ago

It seems to be a quirk of cr-sqlite, it wants to keep track of clock values for the new column. It's not backfilling the field values as far as I understand. There is a comment mentioning it could be optimized away:

https://github.com/vlcn-io/cr-sqlite/blob/891fe9e0190dd20917...

By andrewaylett, 14 hours ago

I assume it would backfill values for any column, as a side-effect of propagating values for any column. But nullable columns are the only type you can add to a table that already contains rows, and mean that every row immediately has an update that needs to be sent.

By candiddevmike, 13 hours ago

[flagged]

By tptacek, 13 hours ago

Not a word of that article came from an LLM. You just don't like my writing.

You think an LLM would have started a sentence with "Which is why that’s how"? Only me, baby.

By sroussey, 12 hours ago

There has been a period of call-out-ai-slop-for-upvotes here for a while that people may have bots just randomly posting such accusations.

Love your response!

By conradev, 11 hours ago

  To ensure every instance arrives at the same “working set” picture, we use cr-sqlite, the CRDT SQLite extension.

Cool to see cr-sqlite used in production!

By cadamsdotcom, 4 hours ago

> for a long time we ran both Corrosion and Consul, because two distributed systems means twice the resiliency.

Nice.

By mrbluecoat, 13 hours ago

For the TL;DR folks: https://github.com/superfly/corrosion

By jimmyl02, 11 hours ago

always wondered at what scale gossip / SWIM breaks down and you need a hierarchy / partitioning. fly's use of corrosion seems to imply it's good enough for a single region which is pretty surprising because iirc Uber's ringpop was said to face problems at around 3K nodes.

it would be super cool to learn more about how the world's largest gossip systems work :)

By tptacek, 11 hours ago

SWIM is probably going to scale pretty much indefinitely. The issue we have with a single global SWIM broadcast domain isn't that the scale is breaking down; it's just that the blast radius for bugs (both in Corrosion itself, and in the services that depend on Corrosion) is too big.

We're actually keeping the global Corrosion cluster! We're just stripping most of the data out of it.

10 hours ago

[deleted]

By chucky_z, 8 hours ago

Back of napkin math I’ve done previously, it breaks down around 2 million members with Hashicorps defaults. The defaults are quite aggressive though and if you can tolerate seconds of latency (called out in the article) you could reach billions without a lot of trouble.

By tptacek, 8 hours ago

It's also frequency of changes and granularity of state, when sizing workloads. My understanding is that most Hashi shops would federate workloads of our size/global distribution; it would be weird to try to run one big cluster to capture everything.

By chucky_z, 7 hours ago

From my literal conversation I'm having right now, 'try to run one big cluster to capture everything' is our active state. I've brought up federation a bunch of times and it's fallen on deaf ears. :)

We are probably past the size of the entirety of fly.io for reference, and maintenance is very painful. It works because we are doing really strange things with Consul (batch txn cross-cluster updates of static entries) on really, really big servers (4gbps+ filesystems, 1tb memory, 100s of big and fast cores, etc).

By mosura, 11 hours ago

Someone needs to read about ant colony optimization. https://en.wikipedia.org/wiki/Ant_colony_optimization_algori...

This blog is not impressive for an infra company.

By tucnak, 11 hours ago

I respect Fly, and it does sound like a nice place to work, but honestly, you're onto something. You would expect ostensibly Public Cloud provider to have a more solid grasp on networking. Instead, we're discovering how they're learning about things like OSPF!

Makes you think that's all.

By tptacek, 10 hours ago

What a weird thing to say. I wrote my first OSPF implementation in 1999. The point is that we noticed the solution we'd settled on owes more to protocols like OSPF than to distributed consensus databases, which are the mainstream solution to this problem. It's not "OMG we just discovered this neat protocol called OSPF". We don't actually run OSPF. We don't even do a graph->tree reduction. We're routing HTTP requests, not packets.

By mosura, 10 hours ago

Look at one of the other comments:

> in case people don't read all the way to the end, the important takeaway is "you simply can't afford to do instant global state distribution"

This is what people saw as the key takeaway. If that takeaway is news to you then I don’t know what you are doing writing distributed systems.

While this message may not be what was intended it was what was broadcast.

By akerl_, 10 hours ago

It seems weird to take an inaccurate paraphrase from a commenter and then use it to paint the authors with your desired brush.

By mosura, 9 hours ago

Not sure the replies to that comment help the cause at all.

By bananapub, 14 hours ago

in case people don't read all the way to the end, the important takeaway is "you simply can't afford to do instant global state distribution" - you can formal method and Rust and test and watchdog yourself as much as you want, but you simply have to stop doing that or the unknown unknowns will just keep taking you down.

By tptacek, 13 hours ago

I mean, the thing we're saying is that instant global state with database-style consensus is unworkable. Instant state distribution though is kind of just... necessary? for a platform like ours. You bring up an app in Europe, proxies in Asia need to know about it to route to it. So you say, "ok, well, they can wait a minute to learn about the app, not the end of the world". Now: that same European instance goes down. Proxies in Asia need to know about that, right away, and this time you can't afford to wait.

By vlovich123, 10 hours ago

> Now: that same European instance goes down. Proxies in Asia need to know about that, right away, and this time you can't afford to wait.

But they have to. Physically no solution will be instantaneous because that’s not how the speed of light nor relativity works - even two events next to each other cannot find out about each other instantaneously. So then the question is “how long can I wait for this information”. And that’s the part that I feel isn’t answered - eg if the app dies, the TCP connections die and in theory that information travels as quickly as anything else you send. It’s not reliably detectable but conceivably you could have an eBPF program monitoring death and notifying the proxies. Thats the part that’s really not explained in the article which is why you need to maintain an eventually consistent view of the connectivity. I get maybe why that could be useful but noticing app connectivity death seems wrong considering I believe you’re more tracking machine and cluster health right? Ie not noticing an app instance goes down but noticing all app instances on a given machine are gone and consensus deciding globally where the new app instance will be as quickly as possible?

By tptacek, 10 hours ago

A request routed to a dead instance doesn't fall into a black hole: our proxies reroute it. But that's very slow; to deliver acceptable service quality you need to minimize the number of times that happens. So you can't accept a solution that leaves large windows of time within which every instance that has gone down has a stale entry. Remember: instances coming up and down happens all the time on this platform! It's part of the point.

By __turbobrew__, 11 hours ago

> Proxies in Asia need to know about that, right away, and this time you can't afford to wait.

Did you ever consider envoy xDS?

There are a lot of really cool things in envoy like outlier detection, circuit breakers, load shedding, etc…

By tptacek, 11 hours ago

Nope. Talk a little about how how Envoy's service discovery would scale to millions of apps in a global network? There's no way we found the only possible point in the solution space. Do they do something clever here?

What we (think we) know won't work is a topologically centralized database that uses distributed consensus algorithms to synchronize. Running consensus transcontinentally is very painful, and keep the servers central, so that update proposals are local and the protocol can run quickly, subjects large portions of the network to partition risk. The natural response (what I think a lot of people do, in fact) is just to run multiple consensus clusters, but our UX includes a global namespace for customer workloads.

By __turbobrew__, 11 hours ago

I haven’t personally worked on envoy xds, but it is what I have seen several BigCo’s use for routing from the edge to internal applications.

> Running consensus transcontinentally is very painful

You don’t necessarily have to do that, you can keep your quorum nodes (lets assume we are talking about etcd) far enough apart to be in separate failure domains (fires, power loss, natural disasters) but close enough that network latency isn’t unbearably high between the replicas.

I have seen the following scheme work for millions of workloads:

1. Etcd quorum across 3 close, but independent regions

2. On startup, the app registers itself under a prefix that all other app replicas register

3. All clients to that app issue etcd watches for that prefix and almost instantly will be notified when there is a change. This is baked as a plugin within grpc clients.

4. A custom grpc resolver is used to do lookups by service name

By tptacek, 11 hours ago

I'm thrilled to have people digging into this, because I think it's a super interesting problem, but: no, keeping quorum nodes close-enough-but-not-too-close doesn't solve our problem, because we support a unified customer namespace that runs from Tokyo to Sydney to São Paulo to Northern Virginia to London to Frankfurt to Johannesburg.

Two other details that are super important here:

This is a public cloud. There is no real correlation between apps/regions and clients. Clients are public Internet users. When you bring an app up, it just needs to work, for completely random browsers on completely random continents. Users can and do move their instances (or, more likely, reallocate instances) between regions with no notice.

The second detail is that no matter what DX compromise you make to scale global consensus up, you still need reliable realtime update of instances going down. Not knowing about a new instance that just came up isn't that big a deal! You just get less optimal routing for the request. Not knowing that an instance went down is a very big deal: you end up routing requests to dead instances.

The deployment strategy you're describing is in fact what we used to do! We had a Consul cluster in North America and ran the global network off it.

By __turbobrew__, 10 hours ago

> I'm thrilled to have people digging into this, because I think it's a super interesting problem

Yes, somehow this is a problem all the big companies have, but it seems like there is no standard solution and nobody has open sourced their stuff (except you)!

Taking a step back, and thinking about the AWS outage last week which was caused by a buggy bespoke system built on top of DNS, it seems like we need an IETF standard for service discovery. DNS++ if you will. I have seen lots of (ab)use of DNS for dynamic service discovery and it seems like we need a better solution which is either push based or gossip based to more quickly disseminate service discovery updates.

By otterley, 6 hours ago

I work for AWS; opinions are my own and I’m not affiliated with the service team in question.

That a DNS record was deleted is tangential to the proximate cause of the incident. It was a latent bug in the control plane that updated the records, not the data plane. If the discovery protocol were DNS++ or /etc/hosts files, the same problem could have happened.

DNS has a lot of advantages: it’s a dirt cheap protocol to serve (both in terms of bytes over the wire and CPU utilization), is reasonably flexible (new RR types are added as needs warrant), isn’t filtered by middleboxes, has separate positive and negative caching, and server implementations are very robust. If you’re doing to replace DNS, you’re going to have a steep hill to climb.

By __turbobrew__, 3 hours ago

> It was a latent bug in the control plane that updated the records, not the data plane

Yes, I know that. But part of the issue is that the control plane exists in the first place to smooth the impedance mismatch between DNS and how dynamic service discovery works in practice. If we had a protocol which better handled dynamic service discovery, the control plane would be much less complex and less prone to bugs.

As far as I have seen, most cloud providers internally use their own service discovery systems and then layer dns on top of that system for third party clients to access. For example, DynamoDB is registered inside of AWS internal service discovery systems, and then the control plane is responsible for reconciling the service discovery state into DNS (the part which had a bug). If instead we have a standard protocol for service discovery, you can drop that in place of the AWS internal service discovery system and then clients (both internal and external) can directly resolve the DynamoDB backends without needing a DNS intermediary.

I don’t know how AWS or DynamoDB works in practice, but I have worked at other hyperscalers where a similar setup exists (DNS is layered on top of some internal service discovery system).

> If you’re doing to replace DNS, you’re going to have a steep hill to climb.

Yes, no doubt. But as we have seen with wireguard, if there is a good idea that has merit it can be quickly adopted into a wide range of operating systems and libraries.

By otterley, 3 hours ago

> If instead we have a standard protocol for service discovery, you can drop [reconciliation] in place of the AWS internal service discovery system and then clients (both internal and external) can directly resolve the DynamoDB backends without needing a DNS intermediary.

DNS is a service discovery protocol! And a rather robust one, too. Don’t forget that.

AWS doesn’t want to expose to the customer all the dirty details of how internal routing is done. They want to publish a single regional service endpoint, put a SLO on it, and handle all the complexity themselves. Saving unnecessary complexity from customers is, after all, one of the key value propositions of a managed service. It also allows the service provider the flexibility to change the underlying implementation without impacting customer clients.

I’m not sure the best response to “the reconciler had a bug, and other reconcilers might, too” is to replace it with an entirely new and untested service discovery protocol. A proposed compensating control to this bug might be as simple as “if the result would be to delete the zone or empty it of all RRs, halt and page the on-call.” Fail open, as it were.

Also, anyone proposing a new protocol in response to a problem—especially one that had nothing to do with the protocol itself—should probably be burdened with defining and implementing its replacement. ;)

By __turbobrew__, an hour ago

> I’m not sure the best response to “the reconciler had a bug, and other reconcilers might, too” is to replace it with an entirely new and untested service discovery protocol

That is not what I am proposing. The current state is that there are two reconcilers (DNS and internal service discovery) and collapsing those into one reconciler protocol will simplify the system.

> especially one that had nothing to do with the protocol itself

Part of the problem is the increased system complexity by layering multiple service discovery systems on top of each other.

> A proposed compensating control to this bug might be as simple as “if the result would be to delete the zone or empty it of all RRs, halt and page the on-call.”

You cannot pre-emptively predict all possible bugs and race conditions. How can I create alerts for all of the failure conditions I have not thought of? A better assumption is that all systems will fail, and one of the things you can do to reduce failure rate is to simplify the system. Additionally, you can segment the system into shards/cells and roll out config and code changes serially to each cell to catch issues before they affect 100% of customers.

I am not hand waving or yelling at the clouds here. I have worked on service discovery for hyperscalars and have witnessed similar outages where the impedance mismatch between internal service discovery and DNS causes issues.

By otterley, 38 minutes ago

> You cannot pre-emptively predict all possible bugs and race conditions. How can I create alerts for all of the failure conditions I have not thought of?

You can’t. That’s just life. The electrical and building codes didn’t start as thousand-page tomes, but as we gained experience over the course of countless incidents, the industry recorded those lessons as prescriptions. Every rule was written in blood, as they say, and now practitioners are bound to follow them. We don’t have the same regulatory framework to ensure we build resilient services, but on the other hand, nobody has died or been seriously injured as a consequence of an internet service failure.

> A better assumption is that all systems will fail, and one of the things you can do to reduce failure rate is to simplify the system.

Why not do both? However, some systems have irreducible complexity for good reason, and it is better to see whether that is in fact the case before proposing armchair prescriptions.

> Additionally, you can segment the system into shards/cells and roll out config and code changes serially to each cell to catch issues before they affect 100% of customers.

I was formerly the lead of the AWS Well-Architected reliability pillar. You’re describing an AWS design and operating principle, and many services do just that (I’m not sure about DynamoDB but it would surprise me if they didn’t). However, at the end of the day, there is a single regional service endpoint customers use.

> I am not hand waving or yelling at the clouds here. I have worked on service discovery for hyperscalars and have witnessed similar outages where the impedance mismatch between internal service discovery and DNS causes issues.

Nobody is accusing you of such behavior, but you also haven’t proposed a concretely better solution, and the one you have mentioned in other replies (Envoy xDS) isn’t built for purpose. It might work fine in the context of a Kubernetes cluster, but it’s certainly not appropriate for Internet-scale service discovery or the planetary scale edge service fabric that fly.io is building.

By tptacek, 4 hours ago

I'm nodding my head to this but have to call out that DNS with "interesting" RRs is extensively filtered by middleboxes --- just none of the middleboxes AWS would deploy or allow to be deployed anywhere it peers.

By __turbobrew__, 10 hours ago

> you still need reliable realtime update of instances going down

The way I have seen this implemented is through a cluster of service watcher that ping all services once every X seconds and deregister the service when the pings fail.

Additionally you can use grpc with keepalives which will detect on the client side when a service goes down and automatically remove it from the subset. Grpc also has client side outlier detection so the clients can also automatically remove slow servers from the subset as well. This only works for grpc though, so not generally useful if you are creating a cloud for HTTP servers…

By tptacek, 10 hours ago

Detecting that the service went down is easy. Notifying every proxy in the fleet that it's down is not. Every proxy in the fleet cannot directly probe every application on the platform.

By __turbobrew__, 7 hours ago

I believe it is possible within envoy to detect a bad backend and automatically remove it from the load balancing pool, so why can the proxy not determine that certain backend instances are unavailable and remove them from the pool? No coordination needed and it also handles other cases where the backend is bad such as overload or deadlock?

It also seems like part of your pain point is that there is an any-to-any relationship between proxy and backend, but that doesn’t need to be the case necessarily, cell based architecture with shuffle sharding of backends between cells can help alleviate that fundamental pain. Part of the advantage of this is that config and code changes can then be rolled out cell by cell which is much safer as if your code/configs cause a fault in a cell it will only affect a subset of infrastructure. And if you did shuffle sharding correctly, it should have a negligible affect when a single cell goes down.

By tptacek, 6 hours ago

Ok, again: this isn't a cluster of load balancers in front of a discrete collection of app servers in a data center. It's thousands of load balancers handling millions of applications scattered all over the world, with instances going up and down constantly.

The interesting part of this problem isn't noticing that an instance is down. Any load balancer can do that. The interesting problem is noticing than and then informing every proxy in the world.

I feel like a lot of what's happening in these threads is people using a mental model that they'd use for hosting one application globally, or, if not one, then a collection of applications they manage. These are customer applications. We can't assume anything about their request semantics.

By __turbobrew__, 5 hours ago

> The interesting problem is noticing than and then informing every proxy in the world.

Yes and that is why I suggested why your any-to-any relationship of proxy to application is a decision you have made which is part of the painpoint that caused you to come up with this solution. The fact that any proxy box can proxy to any backend is a choice which was made which created the structure and mental model you are working within. You could batch your proxies into say 1024 cells and then assign a customer app to say 4/1024 cells using shuffle sharding. Then that decomposes the problem into maintaining state within a cell instead of globally.

Im not saying what you did was wrong or dumb, I am saying you are working within a framework that maybe you are not even consciously aware of.

By tptacek, 5 hours ago

Again: it's the premise of the platform. If you're saying "you picked a hard problem to work on", I guess I agree.

We cannot in fact assign our customers apps to 0.3% of our proxies! When you deploy an app in Chicago on Fly.io, it has to work from a Sydney edge. I mean, that's part of the DX; there are deeper reasons why it would have to work that way (due to BGP4), but we don't even get there before becoming a different platform.

By __turbobrew__, 4 hours ago

I think the impedance mismatch here is I am assuming we are talking about a hyperscaler cloud where it would be reasonable to have say 1024 proxies per region. Each app would be assigned to 4/1024 proxies in each region.

I have no idea how big of a compute footprint fly.io is, and maybe due to that the design I am suggesting makes no sense for you.

By tptacek, 4 hours ago

The design you are suggesting makes no sense for us. That's OK! It's an interesting conversation. But no, you can't fix the problem we're trying to solve with shuffle shard.

By otterley, 6 hours ago

Out of curiosity, what’s your upper bound latency SLO for propagating this state? (I assume this actually conforms to a percentile histogram and isn’t a single value.)

By JoachimSchipper, 7 hours ago

(Hopping in here because the discussion is interesting... feel very free to ignore.)

Thanks for writing this up! It was a very interesting read about a part of networking that I don't get to seriously touch.

That said: I'm sure you guys have thought about this a lot and that I'm just missing something, but "why can't every proxy probe every [worker, not application]?" was exactly one of the questions I had while reading.

Having the workers being the source-of-truth about applications is a nicely resilient design, and bruteforcing the problem by having, say 10k proxies each retrieve the state of 10k workers every second... may not be obviously impossible? Somewhat similar to sending/serving 10k DNS requests/s/worker? That's not trivial, but maybe not _that_ hard? (You've been working on modern Linux servers a lot more than I, but I'm thinking of e.g. https://blog.cloudflare.com/how-to-receive-a-million-packets...)

I did notice the sentence about "saturating our uplinks", but... assuming 1KB=8Kb of compressed critical state per worker, you'd end up with a peak bandwidth demand of about 80 Mbps of data per worker / per proxy; that may not be obviously impossible? (One could reduce _average_ bandwidth a lot by having the proxies mostly send some kind of "send changes since <...>" or "send all data unless its hash is <...>" query.)

(Obviously, bruteforcing the routing table does not get you out of doing _something_ more clever than that to tell the proxies about new workers joining/leaving the pool, and probably a hundred other tasks that I'm missing; but, as you imply, not all tasks are equally timing-critical.)

The other question I had while reading was why you need one failure/replication domain (originally, one global; soon, one per-region); if you shard worker state over 100 gossip (SWIM Corrosion) instances, obviously your proxies do need to join every sharded instance to build the global routing table - but bugs in replication per se should only take down 1/100th of your fleet, which would hit fewer customers (and, depending on the exact bug, may mean that customers with some redundancy and/or autoscaling stay up.) This wouldn't have helped in your exact case - perfectly replicating something that takes down your proxies - but might make a crash-stop of your consensus-ish protocol more tolerable?

Both of the questions above might lead to a less convenient programming model, which be enough reason on its own to scupper it; an article isn't necessarily improved by discussing every possible alternative; and again, I'm sure you guys have thought about this a lot more than I did (and/or that I got a couple of things embarassingly wrong). But, well, if you happen to be willing to entertain my questions I would appreciate it!

By DAlperin, 7 hours ago

(I used to work at Fly, specifically on the proxy so my info may be slightly out of date, but I've spent a lot of time thinking about this stuff.)

> why can't every proxy probe every [worker, not application]?

There are several divergent issues with this approach (though it can have it's place). First, you still need _some_ service discovery to tell you where the nodes are, though it's easy to assume this can be solved via some consul-esque system. Secondly, there is a lot more data than you might be thinking at play here. A single proxy/host might have many thousands of VMs under its purview. That works out to a lot of data. As you point out there are ways to solve this:

> One could reduce _average_ bandwidth a lot by having the proxies mostly send some kind of "send changes since <...>" or "send all data unless its hash is <...>" query.

This is definitely an improvement. But we have a new issue. Lets say I have proxies A, B, and C. A and C lose connectivity. Optimally (and in fact fly has several mechanisms for this) A could send it's traffic to C via B. But in this case it might not even know that there is a VM candidate on C at all! It wasn't able to sync data for a while.

There are ways to solve this! We could make it possible for proxies to relay each others state. To recap: - We have workers that poll each other - They exchange diffs rather than the full state - The state diffs can be relayed by other proxies

We have in practice invented something quite close to a gossip protocol! If we continued drawing the rest of the owl you might end up with something like SWIM.

As far as your second question I think you kinda got it exactly. A crash of a single corrosion does not generally affect anything else. But if something bad is replicated, or there is a gossip storm, isolating that failure is important.

By JoachimSchipper, 6 hours ago

Thanks a lot for your response!

By tptacek, 7 hours ago

Hold up, I sniped Dov into answering this instead of me. :)

By justinparus, 9 hours ago

The solutions across different BigCorp Clouds varies depending on the SLA from their underlying network. Doing this on top the public internet is very different than on redundant subsea fiber with dedicated BigCorp bandwidth!

By otterley, 6 hours ago

Lots of solutions appear to work in a steady-state scenario—which, admittedly, is most of the time. The key question is how resilient to failure they are, not just under blackout conditions but brownouts as well.

Many people will read a comment like this and cargo-cult an implementation (“millions of workloads”, you say?!) without knowing how they are going to handle the many different failure modes that can result, or even at what scale the solution will break down. Then, when the inevitable happens, panic and potentially data loss will ensue. Or, the system will eventually reach scaling limits that will require a significant architectural overhaul to solve.

TL;DR: There isn’t a one-size-fits-all solution for most distributed consensus problems, especially ones that require global consistency and fault tolerance, and on top of that have established upper bounds on information propagation latency.

By hedgehog, 11 hours ago

Is it actually necessary to run transcontinental consensus? Apps in a given location are not movable so it would seem for a given app it's known which part of the network writes can come from. That would require partitioning the namespace but, given that apps are not movable, does that matter? It feel like there are other areas like docs and tooling that would benefit from relatively higher prioritization.

By tptacek, 11 hours ago

Apps in a given location are extremely movable! That's the point of the service!

By hedgehog, 10 hours ago

We unfortunately lost our location with not a whole lot of notice and the migration to a new one was not seamless, on top of things like the GitHub actions being out of date (only supporting the deprecated Postgres service, not the new one).

By nodesocket, 11 hours ago

Anybody used rqlite[1] in production? I'm exploring how to make my application fault-tolerant using multiple app vm instances. The problem of course is the SQLite database on disk. Using a network file system like NFS is a no-go with SQLite (this includes Amazon Elastic File System (EFS)).

I was thinking I'll just have to bite the bullet and migrate to PostgreSQL, but perhaps rqlite can work.

[1] https://rqlite.io

By otoolep, 10 hours ago

rqlite creator here. Right there on the rqlite homepage[1] are listed two production users: replicated.com[2] and textgroove.com are both using it.

[1] https://rqlite.io/

[2] https://www.replicated.com/blog/app-manager-with-rqlite

By tucnak, 11 hours ago

What's this obsession with SQLite? For all intents and purposes, what they'd accomplished is effectively a Type 2 table with extra steps. CRDT is totally overkill in this situation. You can implement this in Postgres easily with very little changes to your access patterns... DISTINCT ON. Maybe this kind of "solution" is impressive for Rust programmers, I'm not sure what's the deal exactly, but all it tells me is Fly ought to hire actual networking professionals, maybe even compute-in-network guys with FPGA experience like everyone else, and develop their own routers that way—if only to learn more about networking.

By tptacek, 11 hours ago

What part of this problem do you think FPGAs would help with?

In what sense do you think we need specialty routers?

How would you deploy Postgres to address these problems?

By tucnak, 10 hours ago

[flagged]

By DAlperin, 9 hours ago

(I used to work at fly on networking)

Fly has a lot of interesting networking issues but I don't know that like, the actual routing of packets is the big one? And even in the places where there is bottlenecks in the overlay mesh I'm not sure that custom FPGAs are going to be the solution for now.

But also this blog post isn't about routing packets, it's about state tracking so we know _where_ to even send our packets in the first place.

By tptacek, 10 hours ago

I'm not sure you understand our problem space.

By throwaway290, 15 hours ago

I guess all designers at fly were replaced by ai because this article is using gray bold font for the whole text. I remember these guys had good blog some time ago

By tptacek, 12 hours ago

The design hasn't changed in years. If someone has a screenshot and a browser version we can try to figure out why it's coming out fucky for you.

By kg, 12 hours ago

Looking at the css, there's a .text-gray-600 CSS style that would cause this, and it's overridden by some other style in order to achieve the actual desired appearance. Maybe the override style isn't loading - perhaps the GP has javascript disabled?

By tptacek, 11 hours ago

Thanks! Relayed.

By throwaway290, 8 hours ago

javascript is enabled but I don't see the problem on another phone, so yeah seems related

By dewey, 15 hours ago

Not sure if that was changed since then, but it's not bold for me and also readable. Maybe browser rendering?

By ceigey, 14 hours ago

Also not bold for me (Safari). Variable font rendering issue?

By throwaway290, 13 hours ago

stock safari on ios 26 for me. is it another of 37366153 regressions of ios 26?

By iviv, 12 hours ago

Looks normal to me on iOS 26.0.1

By throwaway290, 13 hours ago

stock safari on ios

and I think the intended webfont is loaded because the font is clearly weird ish and non-standard and the text is invisible for good 2 seconds at first while it loads:)

By mcny, 13 hours ago

Please try the article mode in your web browser. Firefox has a pretty good one but I understand all major browsers have this now.

By throwaway290, 13 hours ago

I only use article mode in exceptional cases. I hold fly to higher standard than that.

By tptacek, 12 hours ago

D'awwwwww.

By jjtheblunt, 12 hours ago

latest macos firefox and safari both show grey on white, legible but contrast somewhat lacking, but rendered properly for grey on white.

By foofoo12, 15 hours ago

It's totally unreadable.

By davidham, 13 hours ago

Looks like it always has, to me.