[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z27TkUSmM1sCTslO@lappy>
Date: Fri, 27 Dec 2024 11:19:29 -0500
From: Sasha Levin <sashal@...nel.org>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: apw@...onical.com, conor@...nel.org, corbet@....net,
dwaipayanray1@...il.com, geert+renesas@...der.be, gitster@...ox.com,
horms@...nel.org, joe@...ches.com, linux-doc@...r.kernel.org,
linux-kernel@...r.kernel.org, linux@...mhuis.info,
lukas.bulwahn@...il.com, miguel.ojeda.sandonis@...il.com,
niklas.soderlund@...igine.com, willy@...radead.org,
workflows@...r.kernel.org, kees@...nel.org
Subject: Re: [PATCH] git-disambiguate: disambiguate shorthand git ids
On Thu, Dec 26, 2024 at 05:50:22PM -0800, Linus Torvalds wrote:
>On Thu, 26 Dec 2024 at 16:55, Sasha Levin <sashal@...nel.org> wrote:
>>
>> What got me worried originally is the example Kees provided which
>> creates a collision of a 12-character abbreviated commit ID:
>
>Note that you can always create short ambiguous cases.
>
>And in fact, since the way you create them is to just put noise in the
>right place and brute-force things, you can also always make sure that
>the one-liner short commit name will match the target commit too.
>
>In other words: just accept the fact that a short hash is simply not a
>secure hash. That's very very fundamental. It's just inherent in the
>whole "we have shortened things to be legible".
>
>And once you just accept that it's fundamental to any short hash, you
>can relax and just live with it as just a fact of life.
>
>> So sure, we don't have a collision right now, but going from 0 to 1 is
>> going to be painful.
>
>I disagree. You need to just own up to how it is, and more importantly
>that you WILL NEVER EVER BE ABLE TO FIX IT.
>
>There is no way to disambiguate the active malicious case, because the
>active malicious case will just be able to handle any disambiguation
>you throw at it too.
>
>So the fix is to not *rely* on disambiguation, and to just accept it.
>Don't fight reality. You'll just lose.
>
>> Are we going to be actively watching for someone trying to sneak in a
>> commit like that, or should we just handle that case properly?
>
>See above. There is no "properly". There is only reality.
>
>Git internally uses the full hashes. And any reasonably usefully
>shortened hashes *cannot* be disambiguated in the face of an active
>malicious person.
>
>If you have some tool that you rely on absolutely to give you the "One
>Correct Answer", then you are already broken. If thats' the goal, then
>all you can do is give up and pray to some random $deity.
>
>Or, as my argument is, DON'T DO THAT THEN.
>
>Don't rely on some absolute thing. Accept the fact that a short hash
>will always possibly refer to multiple things, and that you will
>*always* need to have a human that actually *THINKS* about it, not
>mindless automation.
>
>The best the tools can do is say "there are multiple options here"
>(or, more commonly, "I found no options at all, but maybe it meant
>XYZ").
>
>Literally.
>
>To summarize: If you are aiming for anything else, you are bound to be
>disappointed.
The script in the original mail will handle two important cases for me:
1. The "wrong ID" case, where the the provided commit ID doesn't exist
and we need to look at the subject line to figure out what the actual ID
is.
2. The case where we either get a too-short commit ID that has a
collision, or we just got unlucky and finally ended up with a 12-char
collision, but the subject line is enough to automatically resolve to
the correct ID.
We're disagreeing over the remaining <0.0001% of cases.
>So aim for that simple "let me know about multiple choices", with the
>knowledge that they are going to be EXCEEDINGLY rare.
>
>And then if we have somebody who actively acts badly and works to make
>that "it's going to be EXCEEDINGLY rare" be much more common than it
>is supposed to be, we deal with that *person* by just not working with
>them any more.
>
>See? The solution is not some kind of "this cannot happen". The
>solution is "stop accepting shit from evil people".
Not necessarily "evil", but consider something similar to the UMN saga
where we get "researchers" trying to sneak in commits that create a
collision. Once these make it in, it will be difficult walking them back
out.
>Which, btw, is not a new thing. This is how open source works. It's
>actually way more work to create collisions in short hashes, when you
>can much more easily just send a bad patch that wastes peoples time
>without any hash collision at all - just by virtue of writing bad
>code.
No, but it makes for a "good" undergrad research paper...
>So thinking that the short hash is some kind of special problem is wrong.
>
>It's in fact a particularly *easy* problem to deal with, because at
>least the whole "somebody did something malicious" is something git
>will balk at, simply because the basic git tools will refuse to touch
>the ambiguous hash.
>
>So *your* tooling should concentrate on one thing, and one thing only:
>making it easier for a *THINKING*HUMAN* to decide what a bad hash
>means.
>
>But you need to do it with the up-front understanding that it requires
>that thinking human, and is not mindless automation.
>
>And as mentioned, the most common case of bad hashes BY FAR is the
>"oh, that doesn't match anything I know about at all", as opposed to
>"oh, that matches two or more different commits".
>
>> >and my point is that this is really not about "disambiguating short
>> >SHA1 IDs". Because those "ambiguous" SHA1's to a very close
>> >approximation simply DO NOT EXIST.
>> >
>> >But the completely wrong ones? They are plentiful.
>>
>> Is the concern mostly around the term "disambiguate"? Would
>> git-sanitize.sh work better?
>
>No, my main worry is that you talk about short hashes (and talked
>about shortening our existing ones even more).
>
>When I really think that the size isn't the main problem at all.
No, it's not a problem. In my mind, I figured we could use shorter
hashes in mail message to make it easier to communicate.
It doesn't have to be formal, but for example if we exchange mails about
an issue, and I end up referring to '1d1a ("arm64/sme: ...")' it both
makes the mail more readable, but still someone who doesn't have context
can still easily get to the commit I was referring to.
To me, it's a nice (minor) improvement.
>"disambiguate" is a fine word, but only if you use it in the context
>of "I have a bad hash". Not just a short one. Because the hash being
>too short is simply not the main case worth worrying about.
>
>And hey, just to really hit this same argument home: I can absolutely
>guarantee that you have a much more insidious problem of "wrong hash",
>namely the "oh, it's actually a valid commit hash, but the Fixes: line
>simply got the wrong commit:".
>
>Because that's a mistake I see regularly: somebody simply blames the
>wrong commit. I've most definitely done that very thing myself.
>Sometimes it's people reading the history wrong, and not realizing
>that the bug actually happened before the blamed change too. Or the
>bug ends up being a combination of factors, and people didn't realize
>that the commit they blamed was just one part of it.
>
>So my complaint really ends up being that anybody who trusts those
>hashes mindlessly is going to have mistakes, and the actual hash
>collision case MUST NOT be the primary worry.
>
>Because as long as that hash size is all you care about, you're
>barking up the wrong tree.
>
> Linus
--
Thanks,
Sasha
Powered by blists - more mailing lists