On division: There is a paper about a division algorithm. If you split your 128bit integer into four 32bit integers, you can use a 64bit CPU built-in instruction to do the 128bit division. If you are on 32bit CPU that doesn't have the 32bit division operation, then split your 128bit number into 8 16bit integers and leverage the 32bit CPU built-in operation. https://surface.syr.edu/eecs_techreports/166/
And having a CPU with "128-by-64" or "64-by-32" division instruction (such as x86/x64) simplifies the algorithm even further.
Inputs: uint[128] N, uint[128] D
Outputs: uint[128] Q, uint[128] R
such that Q ≤ N, R < D, and N = Q * D + R
Uses: (uint[64] quot, uint[64] rem) hw_divmod(uint[128] dividend, uint[64] divisor)
if D.high = 0:
# long division; quotient can be 128-bit wide, but remainder will fit in 64 bits
uint[192] N' := N ≪ lzcnt(D.low)
uint[64] D' := D ≪ lzcnt(D.low)
uint[64] Q', uint[64] R' := hw_divmod(N'.high ≪ 64 + N'.mid, D')
uint[64] Q'', uint[64] R'' := hw_divmod(R' ≪ 64 + N'.low, D')
Q := Q' ≪ 64 + Q''
R := R'' ≫ lzcnt(D.low)
else:
# Quotient will fit in 64 bits, but remainder could be 128-bits wide
uint[192] N' := N ≪ lzcnt(D.high)
uint[128] D' := D ≪ lzcnt(D.high)
uint[64] Q', uint[64] R' := hw_divmod(N'.high ≪ 64 + N'.mid, D'.high)
uint[128] R'' := R' ≪ 64 + N'.low
uint[128] T := Q' * D'.low
if R'' < T:
Q := Q' - 1
R := D' + R'' - T
else:
Q := Q'
R := R'' - T
R := R ≫ lzcnt(D.high)
In highschool, age 14, I was taking a turbo pascal class.
The teacher had assigned us to pick a project to work on for 2 weeks.
I had picked replicating some basic dos application. I legitimately just wanted to know how it worked and get down to a lower level and reimplement it.
The teacher kept asserting "it already exists, pick something else". I was young and didn't really know how to explain it to her at the time and kept insisting on that project.
It turned into some big hoopla, dragging my parents into the principal where we sat and discussed options. They had no idea of what was going on. The teacher kept pushing implementing a card game, I think it was poker. I being completely unreasonable immediately latched onto, "why does it have to be gambling? Why not something useful?"
Neither I, nor my parents had anything against a poker game morally, but that really changed the tone of the conversation and we ultimately landed on a big int implementation.
She had legitimately thought it was some gotcha and was noticeably angry when I demoed it 2 weeks later. It was of course the most naive implementation imaginable but worked well beyond the expectations assigned.
I never told her I finished it in a couple hours.
Whenever I think about it, it just kills me how disfunctional that school was on so many levels. I suppose I was lucky though to have computer science classes at that age in the 90's.
Your teacher being picky about the project you picked sounds a bit ridiculous. It sounds like she was attempting to force you to work on something you weren't interested in when you could have taught yourself better by choosing your own project.
I think in the 90s a lot of schools didn't have a large pool of qualified teachers for cs classes. In my experience, "programming" from the high school level involved finding a teacher who was available, maybe the person who was responsible for the computer lab, and telling him to crack a book.
I took a CS class in highschool the 2000s, this was about 2005 I think. The CS "teacher" had been teaching CS there for a decade I think. Which sure I think that I was lucky to have that class. But she was just a math teacher, if she had taken a computer science class, it was a single one 20 years prior in the 80s. She didn't really have any actual understanding of concepts and skipped some major parts of computer science even at that level (no concept of structs or objects or heap/stack in a C++ class).
Wow. Glad it worked out. There were at least some good high school CS teachers in the 90s. Ours knew that a lot of kids joined the class who were burned out on school, and insisted on going beyond the curriculum to the point where those guys would have a chance at a coding job right out of school - if they put in the work. We also had to implement binary arithmetic from scratch (booleans) although I think that was in the curriculum.
I understand why a non-standard compiler-specific implementation of int128 was not used (Besides being compiler specific, the point of the article is to walk through an implementation of it), but why use
> using u64 = unsigned long long;
? Although in practice, this is _usually_ an unsigned 64 bit integer, the C++ Standard does not technically guarantee this, all it says is that the type need to be _at least_ 64 bits. [0]
I would use std::uint64_t which guarantees a type of that size, provided it is supported. [1]
Re: Multiplication: regrouping our u64 digits
I am aware more advanced and faster algorithms exist, but I wonder if something simple like Karatsuba's Algorithm [2] which uses 3 multiplications instead of 4, could be a quick win for performance over the naive method used in the article. Though since it was mentioned that the compiler-specific unsigned 128 integers more closely resembles the ones created in the article, I suppose there must be a reason for that method to be used instead, or something I missed that makes this method unsuitable here.
Speaking of which, I would be interested to see how all these operations fair against compiler-specific implementations (as well as the comparisons between different compilers). [3]. The article only briefly mentioned their multiplication method is similar for the builtin `__uint128_t` [4], but did not go into detail or mention similarities/differences with their implementation of the other arithmetic operations.
[0] https://en.cppreference.com/w/cpp/language/types.html The official standard needs to be purchased, which is why I did not reference that. But it should be under the section basic.fundamental
[1] https://en.cppreference.com/w/cpp/types/integer.html
[2] https://en.wikipedia.org/wiki/Karatsuba_algorithm
[3] I suppose I could see for myself using godbolt, but I would like to see some commentary/discussion on this.
[4] And did not state for which compiler, though by context, I suppose it would be MSVC?
Since they don't calculate the upper 128-bits of the product, they use only 3 multiplications anyway.
You are right. Not sure how I missed/forgot that. In fact, I think the entire reason I was reminded of the algorithm was because I saw the words "3 multiplications" in the article in the first place. Perhaps I need more coffee...
> I would use std::uint64_t which guarantees a type of that size, provided it is supported.
The comment on the typedef points out that the signature of intrinsics uses `unsigned long long`, though he incorrectly states that `uint64_t` is `unsigned long` - which isn't true, as long is only guaranteed to be at least 32-bits and at least as large as `int`. In ILP64 and LLP64 for example, `long` is only 32-bits.
I don't think this really matters anyway. `long long` is 64-bits on pretty much everything that matters, and he is using architecture-specific intrinsics in the code so it is not going to be portable anyway.
If some future arch had 128-bit hardware integers and a data model where `long long` is 128-bits, we wouldn't need this code at all, as we would just use the hardware support for 128-bits.
But I agree that `uint64_t` is the correct type to use for the definition of `u128`, if we wanted to guarantee it occupies the same storage. The width-specific intrinsics should also use this type.
> I would be interested to see how all these operations fair against compiler-specific implementations
There's a godbolt link at the top of the article which has the comparison. The resulting assembly is basically equivalent to the built-in support.
> though he incorrectly states that `uint64_t` is `unsigned long`
It probably is, he's just probably using MacOS, where both long and long long are 64 bit. https://www.intel.com/content/www/us/en/developer/articles/t...
(that's the best linkable reference I could find, unfortunately).
I've run into a similar problem where an overload resolution for uint64_t was not being used when calling with a size_t because one was unsigned long and the other was unsigned long long, which are both 64 bit uints, but according to the compiler, they're different types.
This was a while ago so the details may be off, but the silly shape of the issue is correct.
> It probably is
This was my point. It may be `unsigned long` on his machine (or any that use LP64), but that isn't what `uint64_t` means. `uint64_t` means a type that is 64-bits, whereas `unsigned long` is simply a type that is larger than `unsigned int` and at least 32-bits, and `unsigned long long` is a type that is at least as large as `unsigned long` and is at least 64-bits.
I was not aware of compilers rejecting the equivalence of `long` and `long long` on LP64. GCC on Linux certainly doesn't. On windows it would be the case because it uses LLP64 where `long` is 32-bits and `long long` is 64-bits.
An intrinsic like `_addcarry_u64` should be using the `uint64_t` type, since its behavior depends on it being precisely 64-bits, which neither `long` nor `long long` guarantee. Intel's intrinsics spec defines it as using the type `unsigned __int64`, but since `__int64` is not a standard type, it has probably implemented as a typedef or `#define __int64 long long` by the compiler or `<immintrin.h>` he is using.
For those curious about division, I wrote a popular uint128 package in Go that uses one of the standard approaches: https://github.com/lukechampine/uint128/blob/3d2701e8e909405...
``` // this is the type used in the intrinsics signature // (and uint64_t is unsigned long, not unsigned long long...) using u64 = unsigned long long; ```
Can somebody explain to me why uint64_t (unsigned long according to the post) cannot be used here?
(I didn't read the article but the headline said "C++" and it did not say asm {...}, so here is a brief description of what's going on)
if you want to multiply two 8 bit numbers together, worst case (biggest numbers), you need 16 bits to hold the answer.
if you want to multiply two 16 bit numbers together, you need 32 bits to hold the answer. and so on, so you need a hardware wordsize twice as big as what you are trying to do.
if you don't have 64 bits, and you decide to use 32 to mimic it, well 32's require 64's to hold the answer and we just agreed you don't have 64s, so you need to go to 16 to catch the overflow into 32 so you can piecewise handle the 16 bit halves of the 32 to mimic the 64 you are trying to pretend you have. It's really just like gradeschool multiplication on paper where you piecewise multiply and add the different pieces together appropriately.
-----
"in hardware", like inside the CPU, you can handle calculations in exactly the space you have because you have extra bit flags like "overflow" OVR and you can do shifts to handle a bit at a time, etc. or even a ton of extra bits that add up to 128 even though the results will be in 64 bit registers. at the level of C and C++ you don't have the facilities for that (a glaring omission imho, but so called smart people are dedicated to taking C even farther from being useful for what it's useful for)
GCC alread has this for x64, I thought. https://gcc.gnu.org/onlinedocs/gcc/_005f_005fint128.html
RISC-V has no carry bit and this whole thing becomes awkward.
I am under the impression that boost::multiprecision has specialized templates for 128 and 256 bit math, but maybe I'm wrong. In practice when I've wanted extended precision, I've just used GMP or a language with bignums.
I would expect the best x86 machine code for many 128 bit operations would use XMM instructions, no? But I haven't investigated.
RVV defines add/sub with carry.
https://rvv-isadoc.readthedocs.io/en/latest/arith_integer.ht...
Interesting, I didn't know that. I wonder how much hassle the setup for that is. Maybe too much overhead to use for general purpose int128. But, there is also a 128-bit extension that might or might not have implementations.
>GMP
When I was a teenager I downloaded some pirated games and reverse-engineered the installer/unpacker and discovered it used UHARC, which seemed to be almost magic in how well it could compress data compared to winzip. Knowing absolutely nothing about compression algorithms or information theory, I decided I'd write my own compression algorithm, as one does.
I reasoned that since any computer value can be represented as a binary number, you could store the exponents to which you'd raise 2 and then just write those exponents to a file. Voila, lossless compression, can't believe nobody thought of this before. I used GMP so that I could "compress" arbitrarily large files (which would require handling arbitrarily large exponents).
Except of course binary bits are already optimal for dense data, so my "compression" algorithm made things orders of magnitude larger, not smaller. But it was fast!
All that to say GMP is a great library. Performant, and even an absolute moron like me could figure out how to use it in the days before ChatGPT.
Fool! You're supposed to subtract 1 from the number, and drop leading zeroes. Then you can compress the file recursively until it's 0 bytes!
I had a similar idea as a teenager - calculate md5 hash and store that plus a hint/offset to then brute force the original content. I had dial up and wanted a more practical way to get large files.
Anyway I emailed the Winrar developers about my idea and they politely explained why they didn't think it was feasible (appreciate they even took the time to respond!)
Question for those smarter than me: What is an application for an int128 type anyways? I've never personally needed it, and I laughed at RISC-V for emphasizing that early on rather than... standardizing packed SIMD.
Cryptography would be one application. Many crypto libraries use an arbitrary size `bigint` type, but the algorithms typically use modular arithmetic on some fixed width types (128-bit, 256-bit, 512-bit, or some in-between like 384-bits).
They're typically implemented with arrays of 64-bit or 32-bit unsigned integers, but if 128-bits were available in hardware, we could get a performance boost. Any arbitrary precision integer library would benefit from 128-bit hardware integers.
I suppose that makes sense -- though SIMD seems more useful for accelerating a lot of crypto?
SIMD is for performing parallel operations on many smaller types. It can help with some cryptography, but It doesn't necessarily help when performing single arithmetic operations on larger types. Though it does help when performing logic and shift operations on larger types.
If we were performing 128-bit arithmetic in parallel over many values, then a SIMD implementation may help, but without a SIMD equivalent of `addcarry`, there's a limit to how much it can help.
Something like this could potentially be added to AVX-512 for example by utilizing the `k` mask registers for the carries.
The best we have currently is `adcx` and `adox` which let us use two interleaved addcarry chains, where one utilizes the carry flag and the other utilizes the overflow flag, which improves ILP. These instructions are quite niche but are used in bigint libraries to improve performance.
> but It doesn't necessarily help when performing single arithmetic operations on larger types.
For the curious, AFAIU the problem is the dependency chains. For example, for simple bignum addition you can't just naively perform all the adds on each limb in parallel and then apply the carries in parallel; the addition of each limb depends on the carries from the previous limbs. Working around these issues with masking and other tricks typically ends up adding too many additional operations, resulting in lower throughput than non-SIMD approaches.
There's quite a few papers on using SIMD to accelerate bignum arithmetic for single operations, but they all seem quite complicated and heavily qualified. The threshold for eeking out any gain is quite high, e.g. minimum 512-bit numbers or much greater, depending. And they tend to target complex or specialized operations (not straight addition, multiplication, etc) where clever algebraic rearrangements can profitably reorder dependency chains for SIMD specifically.
In 2024, I've published a C++ proposal for a 128-bit integer type: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p31...
You can find a lot of motivation for 128-bit integers in that paper, such as fixed-point operations, implementing 128-bit (decimal) floating-point, financial calculations, cryptography, etc. However, the proposal has been superseded by P3666, which aims to bring C23's _BitInt type to C++, which wouldn't just allow for 128-bit integers (as _BitInt(128)) but for any other width as well.
I implemented a rational number library for media timestamps (think CMTime, AVRational, etc.) that uses 64-bit numerators and denominators. It uses 128-bit integers for intermediate operations when adding, subtracting, multiplying, etc. It even uses 128-bit floats (represented as 2 doubles and using double-double arithmetic[1]) for some approximation operations and even 192-bit integers in one spot (IIRC it's multiplying a 128-bit and 64-bit ints and I just want the high bits so it shifts back down to 128 bits immediately after the multiplication).
I keep meaning to see if work will let me open source it.
[1]: https://en.wikipedia.org/wiki/Quadruple-precision_floating-p...
int64_t a, b, c, r;
r = (a * b) / c; /* multiplication step could overflow so use 128bits */
Last time I checked LLVM had surprisingly bad codegen for this using int128. On x86 you only need two instructions:
__asm (
"mulq %[multiplier] "
"divq %[divisor] "
: "=a"(result)
: "a"(num), [multiplier]"r"(multiplier), [divisor]"r"(divisor)
: "rdx"
);
The intermediate 128bit number is in rdx:rax.
That only works if you are sure to have a 64-bit result. If you can have divisor < multiplier and need to detect overflow, it's more complicated.
The last time I used one I wanted UNIX timestamps + fractional seconds. Since there was no difference between adding 1 bit or 64, I just gave it 32 bits for the fraction and 32 more bits for the integral part.
Any application which uses arithmetic on 64bit ints, because most operations can overflow. And most libs/compilers don't check for overflows.
[deleted]
It's used fairly frequently (e.g. in turning 64 bit division into multiplication and shifts).
[deleted]
It's an opaque way to hold a GUID or an IP6 address
This is especially true when dealing with the UUID versions where sort order is meaningful.
[deleted]
I made a time sync library over local network that had to be more precise than NTP and used i128 to make sure the i64 math I was doing couldn't overflow.
I32 didn't cover enough time span and f64 has edge cases from the nature of floats. This was for Windows (MACC not GCC) so I had to roll out my own i128.
Intersection calculations from computational geometry. Intersection calculations generally require about 2*n+log2(n) bits.
If you like your CAD accurate, you have to operate in integer space.
Makes me think of the bad old days where the platform gave you 8-bit ints and you built everything else yourself... or AVR-8.
I guess modern compilers (meaning anything Arduino era and up, at least when I first got into them maybe mid 2010s) abstract that away, because while true that it's doing that under the hood we at least don't have to worry about it.
Even in the 1980s you had C compilers for micros that had bigger numbers than the platform supported. However, if you've got a small number of registers (esp on the 6502) it's not appealing to generate straightforward register-based code and you're likely to use interpreter methods like
https://en.wikipedia.org/wiki/SWEET16
On machines like the Z-80, 6809, and the 8086 you had more real compilers.
The AVR-8 is a special case because it has a huge register file (32 registers!) so if you are using 32-bit numbers you have room for 8 big registers. gcc supports 24-bit integers and those save resources taking just 3 native registers.
I am so happy that MSVC added 128 bit integers to their standard library in order to do ranges distance of uint64_t iota views. One type alias away from int128's on most machines running gcc/clang/msvc
Tangential. A long time ago at a company far far away, this is how we did UUIDs that made up a TenantId and a UserId, using this exact same logic, minus the arithmetic. Great stuff.
(We wanted something UUID like but deterministic that we could easily decompose and do RBAC with, this was prior to the invention of JWT’s, OAuth, and scopes, worked at the time).
[deleted]
> we use 256-bit integers in our hot paths and go up to 564 bits for certain edge cases.
Why 564 bits? That’s 70.5 bytes.
I wonder if it was 5*64 bits that got mangled in editing. If 256 bits is sufficient for most of their code, I could see there being corner cases that need a few more bits but moving to 512 bits would be overkill.
Maybe it's a typo for 512. I'm not even sure how you would achieve 564 in this context.
It was a nice, round number.
> On division: There is no neat codegen for division.
Wait, what? I'm fairly certain that you can do a 128-bit by 128-bit division using a x64's 128-bit by 64-bit division instruction that gives you only 64-bit quotient and remainder. The trick is to pre-multiply both dividend and divisor by a large enough power of 2 so that the "partial" quotient and remainders that the hardware instruction would need to produce will fit into 64 bits. On the whole, IIRC you need either 1 or 2 division instructions, depending on how large the divisor is (if it's too small, you need two divisions).
This. Quite a claim to provide insight into an 128bit type implementation, then sparing out the only non-trivial case for division/remainder. Not to mention interface design around C++ type promotion rules which is essential if this is supposed to be a natural extension.