[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
<MW5PR13MB5632440E0CA2EABD11B79F3CFDB42@MW5PR13MB5632.namprd13.prod.outlook.com>
Date: Wed, 9 Apr 2025 17:44:51 +0000
From: "Bird, Tim" <Tim.Bird@...y.com>
To: Thomas Gleixner <tglx@...utronix.de>, Gon Solo <gonsolo@...il.com>,
Duje Mihanović <duje.mihanovic@...le.hr>
CC: Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
Ricardo Ribalda
<ribalda@...omium.org>,
"linux-spdx@...r.kernel.org"
<linux-spdx@...r.kernel.org>,
"linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>
Subject: RE: spdxcheck: python git module considered harmful (was RE: [PATCH]
scripts/spdxcheck: Limit the scope of git.Repo)
> -----Original Message-----
> From: Thomas Gleixner <tglx@...utronix.de>
> On Tue, Apr 08 2025 at 17:34, Tim Bird wrote:
> >> -----Original Message-----
> > For what it's worth, I've always been a bit skeptical of the use of the python git module
> > in spdxcheck.py. Its use makes it impossible to use spdxcheck on a kernel source tree
> > from a tarball (ie, on source not inside a git repo). Also, from what I can see in spdxcheck.py,
> > the way it's used is just to get the top directories for either the LICENSES dir,
> > the top dir of the kernel source tree, or the directory to scan passed on the
> > spdxcheck.py command line, and then to use the repo.traverse() function on said directory.
> >
> > This ends up excluding any files in the source directory tree that are not checked
> > into git yet, silently skipping them (which I've run into before when
> > using the tool).
>
> The exactly same problem exists the other way round. Run an
> unconstrained version of spdxcheck on a dirty source tree with lots of
> leftovers, then it scans nonsense all the way instead of skipping some
> not yet git tracked files.
Yeah. I thought about this overnight, and came to the same conclusion.
I forgot that most people build the kernel in a way
that the build results end up in the source tree. (Crazy, right?)
I almost always use KBUILD_OUTPUT, and I always use it when I'm doing
embedded and spdx-related work, so I don't often run into build
contamination of the source tree.
>
> The easiest way for me to achieve that was using git to exclude all of
> the irrelevant noise, which I still consider to be a reasonable design
> decision.
Agreed. Given common build practices, this is a reasonable design decision.
I thought there might be a good reason for this design choice, and
was hoping you would respond, Thomas. Thanks for the quick feedback.
>
> And yes, it ignores not yet tracked files, but if you want to check
> them, then it's easy enough to commit them temporarily or provide a
> dedicated file target to the tools, which ignores git.
OK. Yes. That's an easy workaround.
>
> > I think the code could be relatively easily refactored to eliminate the use of the git
> > module, to overcome these issues. I'm not sure if removing the module would
> > eliminate the yield operation (used inside repo.traverse()), which seems to be causing the
> > problem found here. IMHO, in my experience when using python it is helpful
> > to use as few non-core modules as possible, because they tend to break like this
> > occasionally.
> >
> > Let me know if anyone objects to me working up a refactoring of spdxcheck.py
> > eliminating the use of the python 'git' module, and submitting it for review.
>
> I have no objections at all as long as it gives the same result of not
> trying to scan random artifacts which might sit around in a source tree.
>
> But not for the price that I have to create a tarball or a pristine
> checked out tree first to run it. That'd be a usability regression to
> begin with.
Agreed.
>
> Good luck for coming up with a clever and clean solution for that!
I thought about various solutions for this, but each one I came up
with had other drawbacks. If it was just a matter of separating
*.[chS] files from ELF object files, that would be easy to deal with.
But we put SPDX headers on all kinds of files, and there are lots
of other types of files generated during a build that are not just
ELF objects. And build rules change over time. So even if I made
a comprehensive system today to catch build-generated outliers,
the solution would probably need constant updating and tweaking, which
IMHO makes it a no-go.
>
> Just for the record: I rather wish that people would contribute to
> eliminate the remaining 17% (15397 files) which do not have SPDX
> identifiers than complaining about the trivial to solve short-comings of
> the tool, which was written to help this effort and to make sure that it
> does not degrade.
I agree with this. Analyzing where the headers are missing is interesting.
But it's more important to just fix the missing ones.
I'll spend more of my time working on missing headers,
rather than on tools to analyze and report them.
Thanks and regards,
-- Tim
Powered by blists - more mailing lists