The Fifth Kind of Optimisation

URL: tratt.net
12 comments

I think re-evaluation of biases regarding what a single computer/cpu/thread can do would go a long way for a lot of developers.

Having hundreds or thousands of cores to work with is meaningless if the state between those threads is constantly in another castle. Having to go to DRAM or another CCX is a very expensive operation compared to staying within L1/L2 on a single core.

Put differently, if the problem generally has a serialized narrative of events that must be followed (e.g., how we action the current event depends on all prior events being resolved), then spreading the problem across many threads will only slow us down. Contention will immediately dominate. This is where we get things like single writer principle from.

There aren't very many problems that a business would pay you to write software for that are not of the serialized narrative variety. Most ordinary people aren't thinking in terms of mutexes and semaphores. They are thinking in terms of what is effectively a gigantic interpreter lock over their enterprise.

There’s also ‘don’t do it’ ( just turn the whole thing off because it’s actually not needed ) and ‘use a better middleman’ ( os, compiler etc while not going lower level). In my experience these are very cheap and usually not looked at and most devs jump at the other 5.

Also buy a bigger box or better IO or both. A lot of issues are solved by scaling something up. Its not more optimal but it does solve the performance problem and is often cheaper than months of tracing performance and development work.

I’ve lost my voice for a few days because I spent a solid eight hours in a row talking developers out of solving already-solved problems.

“No, you don’t need Redis for caching output, we have a CDN already.”

“No, you don’t need to invent authentication tokens yourself, JWT is a downloadable package away.”

“No…”

“No…”

To be fair, JWT is abysmal in more than one aspect and there's reasonable chance you'd be better off on your own.

It's like "No, you don't need to write a configuration parser, you can just serialize an object" which is a well travelled road straight to Dante's seventh circle.

Same. I have a team lead who's trying to externalize a SQL query to new infrastructure. Why can't we do a database call on the current machine? why does this have to be a task for an external machine to carry out???

It's like the current machine can't wait on the database, we have to wait on an external machine that waits on the database. Why why why? I feel web developers are so used to these patterns that they lose common sense.

I'm sure tons of people will agree with the team lead. I think a lot of people just don't get it.

Oh god, this.

The best part is that by externalising querying, it becomes fiddly to support every possible query shape… so…

… inevitably:

    public object GetSQL( string query ) …
SQL injection is the inevitable end result of this design pattern.

Also “do it in advance (and reuse)”

And batching

[deleted]

Every large scale cloud/SaaS software has been using parallelism for decades [well, almost two decades give or take]. Requests run in parallel on millions of servers and one server processes many requests in parallel.

In that sort of setup the old school optimize one program by doing things in parallel is somewhat less relevant because everything by its nature is massively parallel and complex operations are often broken down into bits that all run in parallel. In that sort of environment over-parallelizing can actually be a pessimization because there is often some cost around breaking the work up to run in parallel and your little bit of this massively parallel system may have some fraction of resources anyways.

Not to say there's no value in knowing "old school" parallelizing (including using SIMD or GPUs or whatnot) but for many types of applications that's not where I'd focus.

Somehow his taxonomy of optimizations doesn't have a category for things like "unroll a loop", "fuse loops", "block your array accesses", "avoid false sharing", or "strength-reduce your operations". He has a "use a better algorithm" category, but his example is switching from bubble sort to selection sort, which is a much more disruptive kind of change than such micro-optimizations, which are usually considered to be "the same algorithm".

This is rather puzzling, given that he starts off with, "Premature optimisation might be the root of all evil, but", quoting Knuth, and in context, Knuth was primarily talking about precisely such micro-optimizations: https://dl.acm.org/doi/10.1145/356635.356640 p. 267 (p. 7/41)

Specifically, the optimizations Knuth was exhibiting in that section of the paper are:

- speeding up a sequential array search by copying the search key to an empty spot at the end of the array, so that the loop only requires one termination test (are keys equal?) rather than two (are keys equal? is index in bounds?), speeding it up on typical machines of the time by 50%;

- unrolling that version of the loop 2× in a way that required gotos, speeding it up by a further 12%.

None of the optimization techniques being applied there fit into any of Tratt’s categories!

Now, I can't claim to be any sort of optimization expert, but I think these kinds of micro-optimizations are still important, though, as Knuth pointed out at the time, not for all code, and often the drawbacks are not worth it. Sometimes compilers will save you. The C compiler might unroll a loop for you, for example. But often it won't, and it certainly isn't going to pull something like the sentinel trick, because it doesn't know you aren't reading the array in another thread.

This sort of depends on what context you're working in; arguably if your hotspots are in Python, even (as Tratt says) in PyPy, you have bigger opportunities for optimization available than manually unrolling a loop to reduce loop termination tests. But I've found that there's often something available in that sort of grungy tweaking; I've frequently been able to double the speed of Numpy code, for example, by specifying the output array for each Numpy operation, thus reducing the cache footprint. The code looks like shit afterwards, sadly.

I think "micro optimizations" would be a good category to put them in (although in my experience the term is usually used in a negative way).

> The C compiler might unroll a loop for you, for example. But often it won't

I've found that clang unrolls loops really aggressively compared to GCC on O2, to the point that I think that the majority of loops is getting unrolled

I’m surprised that “Add caching” is not on that list. Maybe he bundled it under one of the first four. Caching is used so extensively as an optimization method that we don’t even notice it. L1 cache, edge computing, memoization, web browser cache, and file system caches are examples of how pervasive caching is.

Another kind of optimisation is to see if you are solving an unnecessarily general problem, which could be replaced by a narrower problem specific to your actual situation & workload, that might be easier to solve.

see also: Mike Acton's 2014 CppCon talk "Data-Oriented Design and C++": https://www.youtube.com/watch?v=rX0ItVEVjHc

I am often shocked how poor language support for paralelism is. Sorbets asnc/await make writing that kind of code a breeze. Python and Java are much less ergonomic for this. Might have improved with the newest Java versions, but I did not try that yet.

A bunch of these dates are way wrong because the author is stuck in an x86 consumer grade mindset. Even if we didn’t have multicore on our desktops a lot of us had shipped on multicore long before 2014 which is when he marks the era as beginning.

Multiprocessor was already an established solution. I worked on 4 core machines around 2008. As would anyone who worked on server software instead of just desktop.

I made my ray tracer multi-threaded when I got a dual Celeron[1] box at home in 1999.

Quickly learned about race conditions...

Multi-socket was well established in the server and workstation market at that point.

[1]: https://en.wikipedia.org/wiki/ABIT_BP6

Yeah, I was working on a Windows NT image processing program in 01998. My desktop had two processors, and the machines in the dev lab and in the field had four. Mine were Pentium Pro. Some friends of mine had accounts on pioneer.unm.edu, a Sequent Symmetry with I think 8 80386 processors on a single memory bus, up until it was lost or stolen in about 01994. Then I got an account on the MHPCC SP2 UNM had the contract to administer, with I think 384 RS/6000 POWER CPUs, but without shared memory (not even NUMA).

My first Unix account was on a Sequent. I tried to put together a dual processor box in 1995 but they were having supply issues and it didn't work out.

I think it is important to differentiate multi core from multiprocessor, but not so much for the context of that article.

We were also doing multi-threading before we had multiple cores. I remember some things breaking once we actually started seeing the first dual-core machines, probably Pentium-D circa 2005.

Massively parallel machines and parallelization predates that by a lot: https://en.wikipedia.org/wiki/Connection_Machine

I was reporting to a developer some research I'd read that said that about half of errors were logical (and, or, off by 1) and half misusing an API.

The developer replied "and the other half comes from concurrency".

All of us listening did the math and burst out laughing for the trenchant truth

Some people, when confronted with a problem, think, “I know, I’ll use threads,” and then two they hav erpoblesms.

The title implies that the 5th kind of thing is revolutionary and unknown. I mean yeah? You parallelize a task that can be parallelized it's faster...

I thought it was going to be something that was unexpected. But alas that's happening less and less on HN.

0: Make your requirements less dumb.

Let's be real he's over blowing the novelty of parallelism. We know what parallelism is. The real "5th optimization" is quantum computing.