linux-kernel - Small student project idea on appropriate integration trees in MAINTAINERS

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAKXUXMyRAer=0S9pxiRs2iF3pdkU8zW=JZw2a+nJJ30iPLPhCA@mail.gmail.com>
Date:   Fri, 22 Jan 2021 09:22:24 +0100
From:   Lukas Bulwahn <lukas.bulwahn@...il.com>
To:     devel@...ts.elisa.tech
Cc:     Ralf Ramsauer <ralf.ramsauer@...-regensburg.de>,
        Wolfgang Mauerer <wolfgang.mauerer@...-regensburg.de>,
        Jonathan Corbet <corbet@....net>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Pia Eichinger <pia.eichinger@...oth-regensburg.de>,
        Başak Erdamar <basakerdamar@...il.com>
Subject: Small student project idea on appropriate integration trees in MAINTAINERS

Dear all,

here is a small student project idea:

In previous work on MAINTAINERS and process conformance, Pia Eichinger
[1] has investigated: are patches integrated by the maintainers
defined by the responsibilities in MAINTAINERS?

In this project, we are interested in a related (possibly simpler)
question: Are the commits integrated into the appropriate integration
trees referenced in MAINTAINERS?

As I believe, a main difference between considering maintainers and
integration trees is that the information in MAINTAINERS about
integration trees is more erroneous, as it is not used as prominently
as the personal maintainer information, name and email, with the
wide-spread use of ./scripts/get_maintainer.pl. So, correcting those
errors on integration trees in MAINTAINERS is more dominant (but also
simpler) compared to correcting errors on personal maintainer
information in MAINTAINERS.

The answer on the question above can then ultimately be used to
identify which integration tree entries should be added to specific
sections in MAINTAINERS to match best against the actual integration
observed in git.

The factors and metric to determine what is best is of course the
challenging task of identifying a suitable heuristics that is:
  1. good enough to be used to create a change to MAINTAINERS that is
accepted by the community, and
  2. simple enough to be implemented with reasonable effort.

Background:

The MAINTAINERS section includes references, through the T: entries,
to the location of a source configuration management (SCM) tree with
its type, e.g., git, quilt, hg,
For each commit, the kernel git history carries the commit's
integration tree path, i.e., the information through with source
configuration management (SCM) trees a commit was integrated until it
was finally integrated into Linus Torvalds' tree.

Ideally the references in the MAINTAINERS sections are:
  - complete, i.e, all integration trees used for recent kernel
releases are mentioned in MAINTAINERS.
  - sound, i.e., the majority of the commits are integrated through
the trees referenced in the MAINTAINERS sections a patch belongs to.
  - precise, i.e., for each MAINTAINERS section, the majority of the
commits that belong to a  section are integrated through the tree
referenced in that section.

Goal:

We identify and measure to these properties above, completeness,
soundness and precision.

Then, we use that information to determine which integration tree
entries should be added to which specific sections to maximally
increase the three properties.

To evaluate the adequacy of this method, we can obtain feedback from
the responsible kernel maintainers through proposing patches modifying
the MAINTAINERS file, for the additions that we identified as most
relevant (maximally increasing the properties, to a reasonable
threshold of number of patch proposals [to not swamp maintainers
initially] and a threshold on relevance [to not send out minor changes
that are largely irrelevant to the community]).

In this project, we can make use of:

- gitdm [git://git.lwn.net/gitdm.git]: gitdm includes some scripts to
parse MAINTAINERS and obtain the integration tree patch of a commit.

and/or

- pasta [https://github.com/lfd/PaStA]: Similarly to gitdm, pasta
provides functionality to parse MAINTAINERS and some functionalities
on extracting information on commits.

Potential project phases:

- In the first phase (PoC phase), we could probably just create a
setup that combines or extends the functionality in gitdm and/or in
pasta.

- In the second phase (MAINTAINERS patch creation phase), we send out
some patches and collect feedback from maintainers.

- In a third phase, with a better understanding of the individual
pieces in gitdm and/or in pasta, we could then create a cleaner design
that also refactors gitdm and pasta to share the same implementation
when essentially the same basic functionality is used within the
various analyses.

References:

[1] https://lists.elisa.tech/g/devel/message/1269

---

Any thoughts on this small student project?

If it is not too crazy, I will mentor a student on this project
through one of the next mentoring programs (Google Summer of Code, LF
mentorship, etc.).

Lukas