-
Notifications
You must be signed in to change notification settings - Fork 18.3k
Closed
Listed in
Milestone
Description
I'm trying to implement my own map-like data structure. I'm using generics where I have a comparable
type on hand, but I don't have a way to hash it. To workaround it, I'm adding a Hash() uint64
constraint, but that requires every key type to have a custom Hash
method, which is quite unfortunate, given that the Go runtime knows how to hash comparable
types.
I propose we add:
// Comparable returns the hash of comparable value v with the given seed
// such that Comparable(s, v1) == Comparable(s, v2) if v1 == v2.
// If v contains a floating-point NaN, then the hash is non-deterministically random.
func Comparable[T comparable](seed Seed, v T) uint64
Under the hood, this function is a thin wrapper over mapType.hasher
.
There are some caveats with the operation of this function, but they're no different than the caveats of using ==
.
komuw, mpx, Merovius, andy-wm-arthur, ncruces and 74 moreandy-wm-arthur, ajeetdsouza, guzmonne, hdm, mpx and 3 more
Metadata
Metadata
Assignees
Type
Projects
Status
Accepted
Relationships
Development
Select code repository
Activity
dsnet commentedon Aug 25, 2022
I'm realizing a major caveat with this is that this can hash pointers. In order for
Comparable(s, v1) == Comparable(s, v2) if v1 == v2
to hold, this probably implies that Go doesn't use a moving GC.ianlancetaylor commentedon Aug 25, 2022
Note that in #28322 we discussed and rejected the idea of having the package that became hash/maphash support general types.
randall77 commentedon Aug 25, 2022
Even today, we have moving GC if you count stacks. The built-in maps avoid this problem by never allocating on the stack objects that map pointer keys refer to. For
maphash.Comparable
we'd need to make sure that it escapes its second argument so that anything it refers to will not be on the stack.andy-wm-arthur commentedon Dec 21, 2022
I wanted something like
Comparable[T]()
for a package I was working on. When I couldn't find a general-purpose solution, I wrote my version that usesunsafe
to get access to the runtime hash implementation:https://github.com/dolthub/maphash
My use case is a simple stripe-locked map, where a generic hash function is used to pick the stripe:
Looking through the history of this issue, it seems like there are a number of interesting applications for this proposal. @bcmills listed a number of relevant examples in #21195
dsnet commentedon Feb 13, 2023
In my application, I'm writing a generic data structure for packet processing, where I only care about hashing byte arrays.
As a half-step measure, I wonder if there's value in providing a
Comparable
function that panics if the type transitively contains any Go pointers, interfaces, or channels. This avoids the gnarly question of how pointers should be handled until a later point in time. There's always the possibility of providing it later, while providing value for many comparable types today.The downside is that generic data structures (that expose
comparable
as part of their API) are not fully type-safe since only a subset ofcomparable
types are supported. It also functionally introduces a third definition of "comparability". It's already confusing enough that the Go specification distinguishes between "comparable" and "strictly comparable". And we'd be introducing "very strictly comparable" (or something) that's a subset of "strictly comparable".dsnet commentedon Feb 13, 2023
For posterity, here are the thorny issues with pointers, interfaces, and channels:
p1
ash1
and letp1
be GC'd, and then allocatep2
as the same type with the exact same address, should hashingp2
ash2
result inh1 == h2
? Intuitively, the answer seems to be no since these are strictly different objectsp1
was destroyed, andp2
was created in its place and just happened to use the same address. However, naively hashing the address will result in equal hashes.rtype
in an interface is hashed. If we naively hash the pointer, then two Go types that are semantically identical, but alive at different times in the program's lifespan may have different hashes since thertype
(which semantically represents the same Go type) was allocated from different memory addresses).aclements commentedon Sep 12, 2023
I don't think it does. Broadly, there are two solutions to pointer hashing in a moving GC:
Today, option 1 would mean rebuilding all maps. This is possible, but would be extremely difficult to do in a low-latency way. If we were to expose hash values of pointers, this would become essentially untenable. (You could imagine that the hash values we expose are opaque, but that severely limits what you can do with them.)
Option 2 is, I believe, fairly reasonable. This is what Java does for the default hashCode implementation. The main downside is that, today, hashing a pointer value requires no indirection–we simply hash the pointer value–and this would require us to perform an object lookup to check for an original hash (not just an indirection because it could be an interior pointer). It might be possible on some architectures to use a bit in the pointer to indicate that the object has moved and we need to do this extra work. It would also require us to have somewhere to store the original hash that's easy to access from the pointer. But if we're moving objects, we'd probably have a different heap layout, too, and this would simply have to be a consideration.
I think maphash.Comparable would have to escape its argument in the current implementation. That's not ideal, but doesn't seem like a big deal to me.
mpx commentedon Sep 16, 2023
I think there are a couple of additions to Go that make this highly worth reconsidering:
comparable
makes it possible to refer to types that support equality which are suitable for hashing (reinforcing the above).The lack of
maphash.Comparable
(or similar) leaves developers and the ecosystem with 2 bad options:assume-no-moving-gc
).Providing
maphash.Comparable
would be extremely useful and remove this effective limitation of Type Parameters.aclements commentedon Mar 28, 2024
Edit 2024-06-05: Tried to clarify table.
We were discussing this in the Google compiler/runtime team and had some more thoughts.
Stepping back, the basic property of hashing is that
x == y
implieshash(x) == hash(y)
. Values of type*T
that point to the stack cause trouble for this because currently the hash of a pointer is simply a function of the pointer, and stack pointers can change. That is, performinghash(&x)
before and after a stack movement will result in different values.I'm being very careful to say "values of type
*T
" because not all stack pointers are a problem. Notably, values with underlying typestring
structurally contain a pointer that may well point to the stack, but the value of this pointer does not affect the result of==
or of the hash function. We know this is always true because all hash functions are constructed by the compiler.Values with underlying type
chan T
behave like*T
, so for brevity we won't discuss them explicitly below.Values of interface type are more complicated. Their dynamic type may be
*T
and thus they could be a stack pointer. This means static analysis must conservatively assume they may be stack pointers. Dynamic solutions, on the other hand, can inspect the type and value and determine if it's a stack pointer that would affect the hash function.It's informative to look at how built-in maps deal with all this today. Built-in maps are known to the compiler's escape analysis and follow very simple rules:
m[k]
lookups do not escape eitherm
ork
,delete(m, k)
also does not escapem
ork
, butm[k]
as a lvalue for an insert always escapesk
. This works out because we know these operations are going to combine hashing with the map operation. Insert needs to escape the value regardless because it's about to put it in the map (this could perhaps be relaxed if the map itself were stack allocated, but the compiler currently doesn't reason about this). Lookup/delete only need the hash temporarily, so it doesn't have to be stable for stack pointers. Furthermore, because insert always escapes its value, lookup/delete of a stack pointer cannot possibly succeed, so the value of the hash doesn't matter.Options
Once we decouple hashing from data structure operations using that hash, things get more complicated. I see a few options:
We change how pointers to the stack are hashed. This is a purely dynamic solution.
We escape values of type
*T
and interfaces passed tomaphash.Comparable
, so the hash function is never affected by a stack pointer. Since this is a static solution, it has to treat interfaces conservatively and we have two options for dealing with string values:i. We indiscriminately escape the argument to
maphash.Comparable
.ii. We make escape analysis distinguish strings passed to
maphash.Comparable
and not escape them.We use dynamic escape analysis: when
maphash.Comparable
is called at run-time with a value of type*T
that points to the stack, we move its target to the heap. We don't have this capability right now, but we're already exploring it for other reasons. This is a dynamic solution.We somehow track the data flow of the result of
maphash.Comparable
to figure out if it's short-lived for a lookup or long-lived for an insert.1. Change how pointers to the stack are hashed
The hash function for
*T
would normalize the pointer before hashing it such that the normalized value would not change across stack moves. Probably the simplest implementation is to check if the pointer points within the stack bounds and subtract the pointer value from the top-of-stack pointer. This is nice because it's totally localized to the hash function for*T
, and doesn't affect strings or interface values. However, it has some probably minor performance penalty and restricts some future design directions.Today, composite types can sometimes be hashed directly with a single large
memhash
operation, and this would narrow the set of circumstances in which that optimization were possible. There are already many things that disallow this optimization, including string and interface fields, and even padding between fields, so in practice further narrowing it may not matter much. We could potentially use different hash functions for built-in maps and maphash to mitigate this, but that would make binaries bigger and disadvantage maphash. We can also make this decision on the fly: because of escape analysis, if the object we're hashing isn't itself on the stack, we know it can't contain stack pointers, so we can do more efficient hashing. This check is fast and only needs to be done once per hash operation.The more concerning thing to me is that this may restrict some future design directions. For example, we're currently exploring dynamic escape analysis, where values can move from stack to the heap at run-time. This would make the simple normalized hash function no longer stable. We could apply the same solutions as a moving GC, but that has a cost for hashing of all pointers. There may be other ways this would hamstring us.
2. Escape
*T
and interfaces passed to maphashFor a user-defined container that uses maphash, insert is going to escape the key, while lookup and delete can avoid this. This option has the following consequences:
*T
string
(no special case)string
(w/ special case)*T
string
(no special case)string
(w/ special case)With this approach, lookups in user-defined maps are slightly disadvantaged relative to the built-in map. For
*T
, this probably doesn't matter in practice. For interfaces it's unfortunate because interface values often contain pointers that don't affect the hash result. For example, if the interface's dynamic type isint
, the compiler has to box that value and store it as a pointer to anint
value. This solution would force these to escape.For string values (and strings embedded in other types), if we treat them naively, they'll always escape, but we can special-case them in escape analysis to close this particular gap.
3. Dynamically escape stack pointers passed to maphash
Dynamic escape analysis allows the compiler's escape analysis to be less conservative by enabling the runtime to move values allocated on the stack to the heap only when necessary. We don't currently have this ability, but we're exploring it as a general optimization when we have hot/cold path information from PGO and the compiler can determine that all escapes of a value happen on cold paths. We could apply it to maphash as well.
4. Track the data flow of hash values
I include this here for completeness, but I'm not sure how plausible it is. The idea is that, if we can bound the lifetime of a hash value, it doesn't need to escape the hashed valued. This may be possible if we could represent hash values as a distinct type, but in general data structures need to use the numerical value of a hash, e.g., to select as hash bucket, and I think at that point it would be very easy to lose track of the data flow.
rsc commentedon May 15, 2024
The easiest implementation would be to accept that Comparable's arguments escape. We see how to do that very easily. My instinct is that it probably is still useful in that case, but maybe I'm wrong. Is it still useful if arguments escape?
If arguments can't escape then it will take a lot longer to move forward with this.
[-]proposal: hash/maphash: add Comparable to hash comparable values[/-][+]proposal: hash/maphash: add func Comparable[/+]78 remaining items