linux-kernel - Re: [RFC PATCH v1 2/6] kernel-doc: replace kernel-doc perl parser with a pure python one (WIP)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <8B5A4B93-5497-4BEE-8A88-51A7C0E75E32@darmarit.de>
Date:   Wed, 25 Jan 2017 20:07:47 +0100
From:   Markus Heiser <markus.heiser@...marit.de>
To:     Jani Nikula <jani.nikula@...el.com>
Cc:     Jonathan Corbet <corbet@....net>,
        Mauro Carvalho Chehab <mchehab@...radead.org>,
        Daniel Vetter <daniel.vetter@...ll.ch>,
        Matthew Wilcox <mawilcox@...rosoft.com>,
        "linux-doc \@ vger . kernel . org List" <linux-doc@...r.kernel.org>,
        "linux-kernel \@ vger . kernel . org List" 
        <linux-kernel@...r.kernel.org>
Subject: Re: [RFC PATCH v1 2/6] kernel-doc: replace kernel-doc perl parser with a pure python one (WIP)


Am 25.01.2017 um 11:24 schrieb Jani Nikula <jani.nikula@...el.com>:

> Markus, thanks for your work on this.

Thanks for your comments!

> Excuse me for my bluntness, but I think changing everything in a single
> commit, or even a few commits, is strictly not acceptable.

OK, I understand.

> When I changed *small* things in scripts/kernel-doc, I would make
> htmldocs before and after the change, and recursively diff the produced
> output to ensure there were no surprises. We already have enough
> documentation that a manual eyeballing of the output is simply not
> sufficient to ensure things don't break.

> The diff in output between before and after this series? 160k lines of
> unified diff without context ('diff -u0 -r old new | wc -l').
> 
> Many of the changes are improvements on the result, such as using proper
> <div> tags for function parameter lists etc., but clearly changing the
> output should be independent of changing the parser, so we have some
> chance of validating the parser.


Hmm ... I try to sort my thoughts on this:

The both parser are generating reST output. We have tested reST output
so it should be enough to compare the reST from the old one with the
new one ... at least theoretical.

But the problem I see here is, that the perl script generates a
reST output which I can't use. As an example we can take a look at
the man-page builder I shipped in the series.

 https://www.mail-archive.com/linux-doc@vger.kernel.org/msg09017.html

In the commit message there is a small example:

  <section docname="basics ...>
    <title>get_sd_load_idx</title>
    <kernel_doc_man manpage="get_sd_load_idx.9"/>
    ...
    <desc desctype="function" domain="c"....>
      <desc_signature ....>
        <desc_type>int</desc_type>
        ...

You see that it has <section> tag with childs <title/>, <kernel_doc_man/> 
and so on. This structured markup is used by builders, they navigate through
the structured tree picking up nodes and spit out some man-page html, or
whatever builder it is.

ATM the perl parser generates a reST output which does not have such
a structured tree, so the builder can't navigate in.

So, what I mean is, the new parser has to generate a complete different reST
output and thats why we can't compare the perl parser with python one on a reST
basis ... and if reST is different, HTML is different :(

So we do not have any chance to track regression when switching from
the old to the new parser.

Thats are my thoughts on this topic, may be you have a solution for this?

> 
>>   Ideally at the time of merging, we would be able to build the docs with
>>   *either* kerneldoc.
> 
> I'd be fine with switching over in a single commit that doesn't
> drastically change the output.

One solution might be to improve the reST output of the perl script
first, so that it produce something which has a structure and we
all can agree on (short: reST output is the reference, ATM the reference
need some improvements)

If this is a way we like to go, I can send a patch for the perl script,
so that we can commit one a reST reference.

> A drop-in replacement. But that's not the
> case here.
> 
>> - I'll have to try it out to see how noisy it is.  I'm not opposed to
>>   stricter checks; indeed, they could be a good thing.  But we might want
>>   to have an option so we can cut back on the noise by default.
> 
> The increase in 'make htmldocs' build log was from 1521 to 2791 lines in
> my tree. Arguably there was useful extra diagnosis, but some of it was
> the printouts of long lists of definitions that were not found, one per
> line. So it could be condensed without losing info too.

Yes, this was just a 1:1 merge from my POC, there are a lot of things
which could be meld down. ATM, for me it is important to get a feedback 
on the functionalities and concepts of kernel-doc apps (RFC).

> On to performance. With the default build options the new system was
> noticeably slower than the current one, with a 50% increase on my
> machine. But what really caught me by surprise was that passing
> SPHINXOPTS=-j5 to parallelize worked better on the current system,
> making the new one a whopping 70% slower. Of course, the argument is
> that the proposed parser does more and is better, but due to the
> monolithic change it's impossible to pinpoint the culprit or do a proper
> cost/benefit analysis on this. Again, this calls for a more broken down
> series of patches to make the changes.

Ups, I have to look closer ... I thought the py-solution is faster 
since it does not for processes and does some caching.

> Finally, while I'd love to see scripts/kernel-doc go, I do have to ask
> if changing roughly 3k lines of Perl to roughly 3k lines of Python (*)
> really makes everything better? They both still parse everything using a
> large pile of regular expressions and a clunky state machine. When I
> look at the code, I'm afraid I do not get that liberating feeling of
> throwing out old junk in favor of something small or elegant or even
> obviously more maintainable than the old one. The new one offers more
> features, but repeatedly we face the problem that it's all lumped in
> together with the parser change. We should be able to look at the parser
> change and the other improvements separately.
> 
> That said, perhaps having an elegant parser (perhaps based on a compiler
> plugin) is incompatible with the idea of making it a bug-for-bug drop-in
> replacement of the old one, and it's something we need to think about.

Before I started implementing the parser I thought about separating
parsing from generating reST. I played a bit with pycparser

  https://github.com/eliben/pycparser

but I realized that the coverage of those parser might be not
enough for the kernel sources. At this time you mentioned sparse.
I haven't had time to at sparse but I guess that this is the
tool.

-- Markus --


> All in all I think the message should be clear: this needs to be split
> into small, incremental changes. Just like we do everything in the
> kernel.
> 
> 
> BR,
> Jani.
> 
> 
> (*) Please do not get hung up on these numbers. The Python version does
>    more in some ways, but adds more deps such as fspath that's not
>    included in the figures, and the Perl version outputs more
>    formats. It's not an apples to apples comparison. Let's just say
>    they are somewhere in the same ballpark.
> 
> -- 
> Jani Nikula, Intel Open Source Technology Center