[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250225083814.51975742@foz.lan>
Date: Tue, 25 Feb 2025 08:38:14 +0100
From: Mauro Carvalho Chehab <mchehab+huawei@...nel.org>
To: Jonathan Corbet <corbet@....net>
Cc: Linux Doc Mailing List <linux-doc@...r.kernel.org>,
linux-hardening@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 09/39] scripts/kernel-doc.py: add a Python parser
Em Mon, 24 Feb 2025 16:38:58 -0700
Jonathan Corbet <corbet@....net> escreveu:
> Mauro Carvalho Chehab <mchehab+huawei@...nel.org> writes:
>
> > Maintaining kernel-doc has been a challenge, as there aren't many
> > perl developers among maintainers. Also, the logic there is too
> > complex. Having lots of global variables and using pure functions
> > doesn't help.
> >
> > Rewrite the script in Python, placing most global variables
> > inside classes. This should help maintaining the script in long
> > term.
>
> [...]
>
> > diff --git a/scripts/kernel-doc.py b/scripts/kernel-doc.py
> > new file mode 100755
> > index 000000000000..5cf5ed63f215
> > --- /dev/null
> > +++ b/scripts/kernel-doc.py
> > @@ -0,0 +1,2757 @@
> > +#!/usr/bin/env python3
> > +# pylint: disable=R0902,R0903,R0904,R0911,R0912,R0913,R0914,R0915,R0917,R1702
> > +# pylint: disable=C0302,C0103,C0301
> > +# pylint: disable=C0116,C0115,W0511,W0613
> > +# Copyright(c) 2025: Mauro Carvalho Chehab <mchehab@...nel.org>.
> > +# SPDX-License-Identifier: GPL-2.0
>
> The SPDX tag is supposed to be up top, right under the shebang
I'll move it.
>
> I also think you should give consideration to preserving the other
> copyright notices in the Perl version. A language translation doesn't
> remove existing copyrights...who knows how much creativity went into
> some of those regexes?
Makes sense, but the copyrights at kernel-doc.pl:
## Copyright (c) 1998 Michael Zucchi, All Rights Reserved ##
## Copyright (C) 2000, 1 Tim Waugh <twaugh@...hat.com> ##
## Copyright (C) 2001 Simon Huggins ##
## Copyright (C) 2005-2012 Randy Dunlap ##
## Copyright (C) 2012 Dan Luedtke ##
## ##
## #define enhancements by Armin Kuster <akuster@...sta.com> ##
## Copyright (c) 2000 MontaVista Software, Inc. ##
#
# Copyright (C) 2022 Tomasz Warniełło (POD)
Also doesn't preserve all copyrights from people that worked hard to
maintain it all over those years.
A quick check with git log shows 68 different authors touching it
(didn't check how trivial/hard were the changes):
$ git log --follow --pretty="%an" scripts/kernel-doc.pl|sort|uniq -c|sort -n
1 Alexander A. Klimov
1 Alexander Lobakin
1 Anna-Maria Behnsen
1 Bart Van Assche
1 Chen-Yu Tsai
1 Coco Li
1 Dan Luedtke
1 Donald Hunter
1 Gabriel Krisman Bertazi
1 Greg Kroah-Hartman
1 Harvey Harrison
1 Horia Geanta
1 Jason Gunthorpe
1 Jérémy Bobbio
1 Johannes Weiner
1 Jonathan Cameron
1 Kamil Rytarowski
1 Laurent Pinchart
1 Levin, Alexander (Sasha Levin)
1 Linus Torvalds
1 Lucas De Marchi
1 Mark Rutland
1 Masahiro Yamada
1 Michal Wajdeczko
1 Niklas Söderlund
1 Nishanth Menon
1 Peter Maydell
1 Pierre-Louis Bossart
1 Randy.Dunlap
1 Richard Kennedy
1 Rich Walker
1 Rolf Eike Beer
1 Silvio Fricke
1 Utkarsh Tripathi
1 valdis.kletnieks@...edu
1 Will Deacon
2 Ilya Dryomov
2 Jakub Kicinski
2 Jason Baron
2 Jonathan Neuschäfer
2 Markus Heiser
2 Pavan Kumar Linga
2 Pavel Pisa
2 Sakari Ailus
2 Yacine Belkadi
2 Yujie Liu
3 Akira Yokosawa
3 André Almeida
3 Andy Shevchenko
3 Ben Hutchings
3 Borislav Petkov
3 Conchúr Navid
3 Daniel Santos
3 Danilo Cesar Lemes de Paula
3 Mike Rapoport
4 Daniel Vetter
4 Matthew Wilcox
5 Martin Waitz
5 Paolo Bonzini
6 Aditya Srivastava
6 Vegard Nossum
7 Kees Cook
11 Johannes Berg
11 Tomasz Warniełło
20 Jonathan Corbet
32 Jani Nikula
57 Mauro Carvalho Chehab
65 Randy Dunlap
So, picking the latest e-mails from the above authors, maybe we can add
some credit lines like these:
# Converted from the kernel-doc script originally written in Perl
# under GPLv2, copyrighted since 1998 by the following authors:
#
# Aditya Srivastava <yashsri421@...il.com>
# Akira Yokosawa <akiyks@...il.com>
# Alexander A. Klimov <grandmaster@...klimov.de>
# Alexander Lobakin <aleksander.lobakin@...el.com>
# André Almeida <andrealmeid@...lia.com>
# Andy Shevchenko <andriy.shevchenko@...ux.intel.com>
# Anna-Maria Behnsen <anna-maria@...utronix.de>
# Armin Kuster <akuster@...sta.com>
# Bart Van Assche <bart.vanassche@...disk.com>
# Ben Hutchings <ben@...adent.org.uk>
# Borislav Petkov <bbpetkov@...oo.de>
# Chen-Yu Tsai <wenst@...omium.org>
# Coco Li <lixiaoyan@...gle.com>
# Conchúr Navid <conchur@....de>
# Daniel Santos <daniel.santos@...ox.com>
# Danilo Cesar Lemes de Paula <danilo.cesar@...labora.co.uk>
# Dan Luedtke <mail@...rl.de>
# Donald Hunter <donald.hunter@...il.com>
# Gabriel Krisman Bertazi <krisman@...labora.co.uk>
# Greg Kroah-Hartman <gregkh@...uxfoundation.org>
# Harvey Harrison <harvey.harrison@...il.com>
# Horia Geanta <horia.geanta@...escale.com>
# Ilya Dryomov <idryomov@...il.com>
# Jakub Kicinski <kuba@...nel.org>
# Jani Nikula <jani.nikula@...el.com>
# Jason Baron <jbaron@...hat.com>
# Jason Gunthorpe <jgg@...dia.com>
# Jérémy Bobbio <lunar@...ian.org>
# Johannes Berg <johannes.berg@...el.com>
# Johannes Weiner <hannes@...xchg.org>
# Jonathan Cameron <Jonathan.Cameron@...wei.com>
# Jonathan Corbet <corbet@....net>
# Jonathan Neuschäfer <j.neuschaefer@....net>
# Kamil Rytarowski <n54@....com>
# Kees Cook <kees@...nel.org>
# Laurent Pinchart <laurent.pinchart@...asonboard.com>
# Levin, Alexander (Sasha Levin) <alexander.levin@...izon.com>
# Linus Torvalds <torvalds@...ux-foundation.org>
# Lucas De Marchi <lucas.demarchi@...fusion.mobi>
# Mark Rutland <mark.rutland@....com>
# Markus Heiser <markus.heiser@...marit.de>
# Martin Waitz <tali@...ingilde.org>
# Masahiro Yamada <masahiroy@...nel.org>
# Matthew Wilcox <willy@...radead.org>
# Mauro Carvalho Chehab <mchehab+huawei@...nel.org>
# Michal Wajdeczko <michal.wajdeczko@...el.com>
# Michael Zucchi
# Mike Rapoport <rppt@...ux.ibm.com>
# Niklas Söderlund <niklas.soderlund@...igine.com>
# Nishanth Menon <nm@...com>
# Paolo Bonzini <pbonzini@...hat.com>
# Pavan Kumar Linga <pavan.kumar.linga@...el.com>
# Pavel Pisa <pisa@....felk.cvut.cz>
# Peter Maydell <peter.maydell@...aro.org>
# Pierre-Louis Bossart <pierre-louis.bossart@...ux.intel.com>
# Randy Dunlap <rdunlap@...radead.org>
# Richard Kennedy <richard@....demon.co.uk>
# Rich Walker <rw@...dow.org.uk>
# Rolf Eike Beer <eike-kernel@...tec.de>
# Sakari Ailus <sakari.ailus@...ux.intel.com>
# Silvio Fricke <silvio.fricke@...il.com>
# Simon Huggins
# Tim Waugh <twaugh@...hat.com>
# Tomasz Warniełło <tomasz.warniello@...il.com>
# Utkarsh Tripathi <utripathi2002@...il.com>
# valdis.kletnieks@...edu <valdis.kletnieks@...edu>
# Vegard Nossum <vegard.nossum@...cle.com>
# Will Deacon <will.deacon@....com>
# Yacine Belkadi <yacine.belkadi.1@...il.com>
# Yujie Liu <yujie.liu@...el.com>
Note: unfortunately, two of the original authors didn't send any patches
with their names since when we migrated to git. So, I'm placing them
without e-mails.
On a side note, I would keep the above credits line only at the main
kernel-doc.py, as it would be really hard to distribute it along
the several kernel-doc classes.
>
> > +# TODO: implement warning filtering
> > +
> > +"""
> > +kernel_doc
> > +==========
> > +
> > +Print formatted kernel documentation to stdout
> > +
> > +Read C language source or header FILEs, extract embedded
> > +documentation comments, and print formatted documentation
> > +to standard output.
> > +
> > +The documentation comments are identified by the "/**"
> > +opening comment mark.
> > +
> > +See Documentation/doc-guide/kernel-doc.rst for the
> > +documentation comment syntax.
> > +"""
> > +
> > +import argparse
> > +import logging
> > +import os
> > +import re
> > +import sys
> > +
> > +from datetime import datetime
> > +from pprint import pformat
> > +
> > +from dateutil import tz
> > +
> > +# Local cache for regular expressions
> > +re_cache = {}
> > +
> > +
> > +class Re:
>
> So I have to say this bugs me a bit ... the class is fine, but the
> one-letter case-only difference from the standard "re" class is just
> going to make the code harder for others to approach. "kern_re" or
> something like that? Or even "kre" if you really want it to be as short
> as possible.
A short name close to "re" made easier to convert the script ;-)
I'll rename the class to KernRe (*)
(*) I opted to follow pylint convention for class names, which
is using camel case. As we don't have classes in C, this
doesn't strictly conflicts with our Kernel conventions ;-)
To make easier, I'll do such change as a separate patch at the end
of the series.
>
> > + """
> > + Helper class to simplify regex declaration and usage,
> > +
> > + It calls re.compile for a given pattern. It also allows adding
> > + regular expressions and define sub at class init time.
> > +
> > + Regular expressions can be cached via an argument, helping to speedup
> > + searches.
> > + """
>
> [...]
>
> > +
> > +class KernelDoc:
> > + # Parser states
> > + STATE_NORMAL = 0 # normal code
> > + STATE_NAME = 1 # looking for function name
> > + STATE_BODY_MAYBE = 2 # body - or maybe more description
> > + STATE_BODY = 3 # the body of the comment
> > + STATE_BODY_WITH_BLANK_LINE = 4 # the body which has a blank line
> > + STATE_PROTO = 5 # scanning prototype
> > + STATE_DOCBLOCK = 6 # documentation block
> > + STATE_INLINE = 7 # gathering doc outside main block
> > +
> > + st_name = [
> > + "NORMAL",
> > + "NAME",
> > + "BODY_MAYBE",
> > + "BODY",
> > + "BODY_WITH_BLANK_LINE",
> > + "PROTO",
> > + "DOCBLOCK",
> > + "INLINE",
> > + ]
>
> So these ... kind of look like enums?
Yes. I considered using Python enum there:
https://docs.python.org/3/library/enum.html
but it would simply be:
from enum import Enum
class state(enum):
NORMAL = 0 # normal code
NAME = 1 # looking for function name
BODY_MAYBE = 2 # body - or maybe more description
BODY = 3 # the body of the comment
BODY_WITH_BLANK_LINE = 4 # the body which has a blank line
PROTO = 5 # scanning prototype
DOCBLOCK = 6 # documentation block
INLINE = 7 # gathering doc outside main block
and we would replace:
self.STATE_NORMAL
...
to:
state.NORMAL
(and a similar change for inline states as well)
Except for an ancillary number-to-string conversion with this syntax:
Color = Enum('Color', [('RED', 1), ('GREEN', 2), ('BLUE', 3)])
I didn't see much advantage of using it. See, we only use the string
on a single line, when --debug is used to show the per-line state machine.
Yet, perhaps we can split the states on a separate class anyway.
> That's kind of it for nits ... I do have one wish that will kind of hard
> to grant overall ... for the long-term maintenance of this code, it
> would be really nice if every non-trivial regex were described by a
> comment explaining what it is trying to do. It's not reasonable to
> expect that as a condition for accepting this rewrite, but it sure would
> be a nice goal to be working toward.
Agreed. For the future, the best would be to request a description for
every new regex added.
For the future, my plan is to split the C source parser (e.g. the code that
calls output_declaration() and/or modifies self.entry at the KernelDoc
class inside kdoc_parser to a separate file and class. We can then work
to implement some regexes to a more lexical way. For instance, all those
regexes:
(Re(r'\bstruct_group\s*\(([^,]*,)', re.S), r'STRUCT_GROUP('),
(Re(r'\bstruct_group_attr\s*\(([^,]*,){2}', re.S), r'STRUCT_GROUP('),
(Re(r'\bstruct_group_tagged\s*\(([^,]*),([^,]*),', re.S), r'struct \1 \2; STRUCT_GROUP('),
(Re(r'\b__struct_group\s*\(([^,]*,){3}', re.S), r'STRUCT_GROUP('),
(Re(r'__ETHTOOL_DECLARE_LINK_MODE_MASK\s*\(([^\)]+)\)', re.S), r'DECLARE_BITMAP(\1, __ETHTOOL_LINK_MODE_MASK_NBITS)'),
(Re(r'DECLARE_PHY_INTERFACE_MASK\s*\(([^\)]+)\)', re.S), r'DECLARE_BITMAP(\1, PHY_INTERFACE_MODE_MAX)'),
(Re(r'DECLARE_BITMAP\s*\(' + args_pattern + r',\s*' + args_pattern + r'\)', re.S), r'unsigned long \1[BITS_TO_LONGS(\2)]'),
(Re(r'DECLARE_HASHTABLE\s*\(' + args_pattern + r',\s*' + args_pattern + r'\)', re.S), r'unsigned long \1[1 << ((\2) - 1)]'),
(Re(r'DECLARE_KFIFO\s*\(' + args_pattern + r',\s*' + args_pattern + r',\s*' + args_pattern + r'\)', re.S), r'\2 *\1'),
(Re(r'DECLARE_KFIFO_PTR\s*\(' + args_pattern + r',\s*' + args_pattern + r'\)', re.S), r'\2 *\1'),
(Re(r'(?:__)?DECLARE_FLEX_ARRAY\s*\(' + args_pattern + r',\s*' + args_pattern + r'\)', re.S), r'\1 \2[]'),
(Re(r'DEFINE_DMA_UNMAP_ADDR\s*\(' + args_pattern + r'\)', re.S), r'dma_addr_t \1'),
(Re(r'DEFINE_DMA_UNMAP_LEN\s*\(' + args_pattern + r'\)', re.S), r'__u32 \1'),
could be converted on a more lexical representation where the macro
name with their arguments could be represented on a clearer expression.
Something like:
ReplaceMatch("DECLARE_HASHTABLE($1, $2)", "unsigned long $1[1 << (($2) - 1)]")
We have already the NestedMatch class that do something similar to
that(*) could be improved to do things like that - or we can come up with
something new.
(*) NestedMatch currently miss one feature: it doesn't split the contents
inside parenthesis on multiple match groups. I guess it won't be hard to
add such features to it.
Thanks,
Mauro
Powered by blists - more mailing lists