linux-kernel - Re: [RFC PATCH 0/2] docs: automarkup.py: Add literal markup of known constants

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210609094535.7ed61991@coco.lan>
Date:   Wed, 9 Jun 2021 09:45:35 +0200
From:   Mauro Carvalho Chehab <mchehab+huawei@...nel.org>
To:     "Nícolas F. R. A. Prado" <n@...aprado.net>
Cc:     linux-doc@...r.kernel.org, Jonathan Corbet <corbet@....net>,
        Matthew Wilcox <willy@...radead.org>,
        linux-kernel@...r.kernel.org,
        André Almeida <andrealmeid@...labora.com>,
        ~lkcamp/patches@...ts.sr.ht
Subject: Re: [RFC PATCH 0/2] docs: automarkup.py: Add literal markup of
 known constants

Hi Nícolas,

Em Tue,  8 Jun 2021 22:43:06 -0300
Nícolas F. R. A. Prado <n@...aprado.net> escreveu:

> Hi,
> 
> a while back Matthew suggested adding automatic markup of known constants, like
> NULL and error codes, as literals [1]. This is a great idea since those literals
> are very frequently used throughout the documentation and are fixed names, so we
> can easily match them without false positives and make the source files less
> cluttered.
> 
> Patch 1 adds that feature to automarkup.py, while patch 2 removes the no longer
> needed markup from the xarray.rst file which was given as an example.
> 
> Some things I'd like to discuss:
> 
> * As a first draft I added the constants I found on xarray.rst plus all error
>   codes from errno-base.h and errno.h. We don't need to start with everything,
>   but we can certainly do better than this. What are common constants that
>   should be added here? (Matthew mentioned ANSI C or POSIX, some reference to
>   the constants in those would be great, even more if they are easily parsable)
> 
> * The Literals list added is already a bit big with just the error codes, so
>   perhaps we should move them to a separate plain text file that is read on
>   startup?

If we're willing to keep a long list, they should be parsed from files,
instead of directly included. For things like errno-base.h/errno.h, the
better would be if automarkup would read them directly from the header
file, as otherwise there will be a maintainance burden.

Yet, a regex like "-E[A-Z\d]+\b" would likely get all cases, but this
will produce false positives, like EXAMPLE, EXAMPLES and some other
non-trivial exceptions.

With regards to NULL, I would just use a trivial regex like: "\bNULL\b"
(but see below my notes about using "\b").

> * There was also mention of automatically converting uppercase words to
>   literals. I'm not so sure about that one so I left it out for now.
> 
>   The example given was XA_MARK_0, which is a #define in xarray.h. The way
>   to go here is certainly to first use kernel-doc to get it included in the
>   documentation, and then cross-reference to it. FWICT at the moment kernel-doc
>   for defines should be done like for functions (but the parenthesis can be
>   omitted, although they show up in the output):
> 
> 	/**
> 	  * XA_MARK_0 - Brief description.
> 	  *
> 	  * Description of the type.
> 	  */
> 
>   At the current state, the cross-reference would then need to be done either
>   through plain :c:macro:`XA_MARK_0`, or, by relying on automarkup, typedef
>   XA_MARK_0 (which is not ideal since this isn't a typedef). Options for
>   improvements would be to add 'macro' as a possible prefix (so eg. macro
>   XA_MARK_0), or go all out and try to cross-reference on every uppercase word
>   like suggested. We should strive for what is the most natural to write, as
>   long as the regex isn't too broad.
> 
>   Since automarkup.py is opt-out rather than opt-in, I think we should be extra
>   careful not to make the regexes too broad, even if in this case it relies on a
>   C symbol being present.
> 
>   One other idea I had was that we could limit to all uppercase words as
>   long as it has an _ on it, since I don't expect we would get false positives
>   with that. The downside is that it might be very confusing to people why their
>   MACRONAME isn't being automatically cross-referenced to...

What it can be done would be to first check/apply cross-references. Only if it
fails, then replace everything in uppercase to literals.

There is a drawback, tough: this would cause texts in uppercase. We used to
have lots of them before ReST conversion. I won't doubt that some files
would have kept a few of them. So, in this case, the regex would need to
require at least one _, e. g. something like: 

	\b[\-]?[A-Z_][A-Z\d_]*_[A-Z\d_]*\b

Again, tests are needed in order to check if the regex won't get anything
that shouldn't be caught. So, I would grep the source codes in order to
check if the regexes won't bring false positives, e. g.:

	$ git grep -E "\b[\-]?[A-Z_][A-Z\d_]*_[A-Z\d_]*\b" Documentation/


Btw, the "\b" at the end won't actually work, due to things like:

	GLOBAL_GEN_STORAGE{0:3}
	GAUDI_ENGINE_ID_*

If you take a look at the scripts/get_abi.pl that I wrote, I faced
the same problem there with \b. So, I ended implementing my own set
of \b:
	 
	my $start = qr {(^|\s|\() }x;
	my $bondary = qr { ([,.:;\)\s]|\z) }x;

Using the above replacements for \b, I would do something like this to
double-check if the regex won't be producing false-positives:

	$ for i in $(find Documentation/ -name '*.rst'); do perl -ne 'print "$1  \e[0;31m$2\e[0;37m  $3\n" if /(^|\s|\()([\-]?[A-Z_][A-Z\d_]*_[A-Z\d_]*)([,.:;\)\s]|\z)/' $i; done

Btw, on a quick look, this specific regex:

	(^|\s|\()([\-]?[A-Z_][A-Z\d_]*_[A-Z\d_]*)([,.:;\)\s]|\z)

Seems to be producing just one false positive: a sequence of _, e. g.:

	\b\_+\b

While we could make the regex more complex, I would just test if the second
match is '^\_+$', skipping it if found. On other words, a python-equivalent
to this:

	$ for i in $(find Documentation/ -name '*.rst'); do perl -ne 'print "$2\n" if /(^|\s|\()([\-]?[A-Z_][A-Z\d_]*_[A-Z\d_]*)([,.:;\)\s]|\z)/' $i|grep -vE "^\_+$"; done|sort|uniq

-

That's said, it is worth to also mention the false negatives:

	$ for i in $(find Documentation/ -name '*.rst'); do perl -ne 'print "$2\n" if /(^|\s|\()([\-]?[A-Z_][A-Z\d_]*_[A-Z\d_]*)([,.:;\)\s]|\z)/' $i|grep -vE "^\_+$"; done|sort|uniq >OK_list
	$ for i in $(find Documentation/ -name '*.rst'); do perl -ne 'print "$2\n" if /(^|\s|\()([\-]?[A-Z_][A-Z\d_]*)([,.:;\)\s]|\z)/' $i|grep -vE " "; done|sort|uniq >FULL_list
	$ grep -v -f OK_list FULL_list >excluded_list
	$ wc -l OK_list FULL_list excluded_list 
	  7398 OK_list
	 14487 FULL_list
	  7520 excluded_list
	 29405 total

There are a lot of constants that won't be parsed if we require at least
one '_'. Looking at the excluded_list, indeed there are things there
which should be excluded, like EXAMPLE, EXAMPLES, EXCEPTION, WITH, WITHIN,
WITHOUT..., but also there are a several matches that are constants,
like KERNELRELEASE, DISCONTIGMEM, SIOCGIFINDEX, SIGALRM, SETREGSET,
CDROMREADCOOKED, CDROMREADRAW, ...

I doubt that there would be an easy way to handle such cases, as
a file with a ~7000 constants would be a maintenance nightmare.

---

In summary, my suggestion is that we should stay out of having a big list
of constants. So, I would start with:

	1. a regex to get FOO_BAR cases, like:

		(^|\s|\()([\-]?[A-Z_][A-Z\d_]*_[A-Z\d_]*)([,.:;\)\s]|\z)

	   with a test to exclude this pattern:

		 ^\_+$

	2. a logic that will pick and use errno codes from the errno*.h
	   files;

	3. an specific handler handler for the NULL special case.

This would get 7000+ different constants, which seems a very good
start. If needed, later we could get a few other relevant constants
by checking the most-used ones with something like:

	for i in $(find Documentation/ -name '*.rst'); do perl -ne 'print "$2\n" if /(^|\s|\()([\-]?[A-Z_][A-Z\d_]*)([,.:;\)\s]|\z)/' $i|grep -v "_"; done|sort|uniq -c|sort -n

in order to grab the most common ones.

Regards,
Mauro