Interested to hear what @mzero has to say
Well, the first thing I’m going to say is don’t try to do math with “AI”. Large Language Models are trained to produce “reasonable textual responses”, not to compute values or deliver facts. One of those answers may be right… so why both with the AI if you want the actual answer, since you’ll have to check with and actual source info about the subject to be sure?
Now, considering only the 32-bit hash, the probability of a collision in 2000 samples is:
h = 2 ^ 32 -- number of possible hashes
k = 2000 -- number of elements
u = h! / (h -k)! -- number of ways to have k unique hashes
a = h ^ k -- number of ways to have k hashes (unique or not)
q = u / a -- probability of having all unique hashes in a random group of k elements
p = 1 - q -- probability of having a collision in a random group of k elements
These numbers are big (it’s those factorials), and direct calculation won’t generally be possible. So, we use approximations. The following is good enough for our purposes:
p ≈ 1 - e ^ -( k^2 / 2h ) -- useful if k << h, which it is
p ≈ 0.000466 ≈ 1 in 2148
Diversion: Google’s AI is bullshitting its way through, and its conclusion is very misleading. ChatGPT has gotten this one correct, but don’t expect less bullshit from it in general just because it got this one right.
Now - what does this mean in practice. Let’s spell it out:
If you load a Digitakt with 2000 random samples, all the exact same length, then you have about a 1 in 2000 chance that there is a hash collision between two samples.
-or-
You have to make 1000 collections of 2000 random samples, all the exact same length, before you have about a 50% chance of seeing a collision in one of those sets.
Okay, I get that a casual eye, this looks like the realm of possibility. I admit, that as an engineer, I would have done this calc, and then increased the hash size used by the system. (For example, just increasing to 40 bits makes p ≈ 1 in 550,000.)
But remember that we are only talking about samples of exactly the same size. I did a search over my entire collection of samples (though I’m sure some of you have larger ones) - here’s the top ten repeated sample sizes:
n | size of sample
73 133760
64 209024
64 150656
46 2648192
32 530048
32 464000
32 331904
30 2119808
16 79424
16 767384
Let’s take 73 as the value of k: The chance that these samples have a collision is:
p ≈ 1 - e ^ -( 73^2 / 2*2^32 )
p ≈ 0.00000062038 or 1 in 1.6 million
I think we’re in generally safe territory…
Hashes are used for many things beyond just hash tables. Pretty much every cryptographic protocol (including https: that we are using right now) has hashes involved.
Hash tables worry about collisions because h is so small: the size of the table, typically at most 1000s.
In most other uses, h is very big: 2^64 or 2^128 is common. In these cases, collisions in any reasonably sized set are virtually non-existent.
How non-existant? If you assigned a 128 bit hash to every atom on earth (1.3 * 10^50 of them) - you still only have a chance of collision 1 in 60 million million times! If you make that assignment anew every second, it would still take you 900k years before you had a 50% of collision.
Given these numbers, it is quite common to use a large hash to be a definitive determinant of a identity. Every digitally signed document and certificate relies on this. Many digital asset systems (like collections of images, 3D models, etc.) also use hashes as equivalent to identity.
Sorry for the long explanation… in a prior time, I was a cryptographic researcher at Google working on systems extensively using hashes!
Meanwhile, I had been focusing on rehearsing for a gig that was just last Sunday, and now I can get back to elk-herd… which might have a bug or two to be squashed!