[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z237CwC_YKhoZubs@lappy>
Date: Thu, 26 Dec 2024 19:55:39 -0500
From: Sasha Levin <sashal@...nel.org>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: apw@...onical.com, conor@...nel.org, corbet@....net,
dwaipayanray1@...il.com, geert+renesas@...der.be, gitster@...ox.com,
horms@...nel.org, joe@...ches.com, linux-doc@...r.kernel.org,
linux-kernel@...r.kernel.org, linux@...mhuis.info,
lukas.bulwahn@...il.com, miguel.ojeda.sandonis@...il.com,
niklas.soderlund@...igine.com, willy@...radead.org,
workflows@...r.kernel.org, kees@...nel.org
Subject: Re: [PATCH] git-disambiguate: disambiguate shorthand git ids
On Thu, Dec 26, 2024 at 03:33:36PM -0800, Linus Torvalds wrote:
>On Thu, 26 Dec 2024 at 14:33, Sasha Levin <sashal@...nel.org> wrote:
>>
>> Which means that folks should be able to use a fairly short abbreviated
>> commit IDs in messages, specially for commits with a long subject line.
>
>So I don't think we should take this as a way to make *shorter* IDs,
>just as a way to accept historical short IDs.
>
>Also, I absolutely detest how you made this be all about "short IDs".
>
>As mentioned in my very original email on this matter, the actual REAL
>AND PRESENT issue isn't ambiguous IDs. We don't really have them.
What got me worried originally is the example Kees provided which
creates a collision of a 12-character abbreviated commit ID:
$ git log 1da177e4c3f4
error: short object ID 1da177e4c3 is ambiguous
hint: The candidates are:
hint: 1da177e4c3f41 commit 2005-04-16 - Linux-2.6.12-rc2
hint: 1da177e4c3f47 commit 2024-12-14 - docs: git SHA prefixes are for humans
When I tested it locally, my scripts were broken, our CVE scripts were
broken, and it is quite the PITA to address after the fact (think of all
the "Fixes: 1da177e4c3f4 [...]" lines we have in the log).
So sure, we don't have a collision right now, but going from 0 to 1 is
going to be painful.
Are we going to be actively watching for someone trying to sneak in a
commit like that, or should we just handle that case properly?
>What we *do* have is "wrong IDs". We have a ton of those.
>
>Look here, you can get a list of suspiciously wrong SHA1s with
>something like this:
>
> git log |
> egrep '[0-9a-f]{9,40} \(".*"\)' |
> sed 's/.*[^0-9a-f]\([0-9a-f]\{9,40\}\)[^0-9a-f].*/\1/' |
> sort -u > hexes
>
>which generates a list of things that look like commit IDs (ok,
>there's probably smarter ways) in our git logs.
>
>Now, *some* of those won't actually be commit IDs at all, they'll just
>be random hex numbers the above finds.
>
>But most of them will indeed be references to other commits.
>
>Then you try to find the bogus ones by doing something like
>
> cat hexes |
> while read i; do git show $i >& /dev/null || echo "No $i SHA"; done
>
>and you will get a lot ot hits.
>
>A *LOT*.
I ended up with this fun thing:
git log --pretty=format:'%B' | grep -E '^Fixes:[[:space:]][0-9a-fA-F]{8,40}' | while read -r line; do
[[ $line =~ ^Fixes:[[:space:]]([0-9a-fA-F]{8,40})(.*) ]] && \
! git rev-parse --quiet --verify "${BASH_REMATCH[1]}^{commit}" >/dev/null 2>&1 && \
{ r=$(./scripts/git-disambiguate.sh --force "${BASH_REMATCH[1]}" "${BASH_REMATCH[2]}"); \
[[ -n $r ]] && echo "Bad: $line" && echo "Good: Fixes: $(git ol "$r")"; }
Which looks for these wrong IDs in "Fixes:" tags and tries to run them
through git-disambiguate.sh:
Bad: Fixes: d71e629bed5b ("ARC: build: disallow invalid PAE40 + 4K page config")
Good: Fixes: 8871331b1769 ("ARC: build: disallow invalid PAE40 + 4K page config")
Bad: Fixes: 7e654ab7da03 ("cifs: during remount, make sure passwords are in sync")
Good: Fixes: 0f0e35790295 ("cifs: during remount, make sure passwords are in sync")
Bad: Fixes: ee650b3820f3 ("apparmor: properly handle cx/px lookup failure for complain")
Good: Fixes: db93ca15e5ae ("apparmor: properly handle cx/px lookup failure for complain")
[...]
>Look, I didn't check very many of them. Mainly because it gets *so*
>many hits, and I get bored very easily.
>
>But I did check a handful, just to get a feel for things.
>
>And yes, some of them were random hex numbers unrelated to actual git
>IDs, but most were really supposed to be git IDs. Except they weren't
>- or at least not from the mainline tree.
>
>For example, look at commit daa07cbc9ae3 ("KVM: x86: fix L1TF's MMIO
>GFN calculation") which references one of those nonexistent commit
>IDs:
>
> Fixes: d9b47449c1a1 ("kvm: x86: Set highest physical address bits
>in non-present/reserved SPTEs")
>
>and I have no idea where that bogus commit ID comes from. Maybe it's a
>rebase. Maybe it's from a stable tree. But it sure doesn't exist in
>mainline.
This one is indeed in the 4.18 stable tree.
>What *does* exist is commit 28a1f3ac1d0c ("kvm: x86: Set highest
>physical address bits in non-present/reserved SPTEs"), which I found
>by just doing that
>
> git log --grep='kvm: x86: Set highest physical address bits in
>non-present/reserved SPTEs'
>
>and my point is that this is really not about "disambiguating short
>SHA1 IDs". Because those "ambiguous" SHA1's to a very close
>approximation simply DO NOT EXIST.
>
>But the completely wrong ones? They are plentiful.
Is the concern mostly around the term "disambiguate"? Would
git-sanitize.sh work better?
--
Thanks,
Sasha
Powered by blists - more mailing lists