[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YzGkD5LvFmZIOtnx@wendy>
Date: Mon, 26 Sep 2022 14:07:27 +0100
From: Conor Dooley <conor.dooley@...rochip.com>
To: Petr Mladek <pmladek@...e.com>
CC: Conor Dooley <conor@...nel.org>, <senozhatsky@...omium.org>,
<rostedt@...dmis.org>, <john.ogness@...utronix.de>,
<linux-kernel@...r.kernel.org>
Subject: Re: [resend][bug] low-probability console lockups since 5.19
On Mon, Sep 26, 2022 at 12:32:55PM +0200, Petr Mladek wrote:
> On Fri 2022-09-23 17:24:17, Conor Dooley wrote:
> > Hey all,
> >
> > Been bisecting a bug that is causing a boot failure in my CI & have
> > ended up here.. The bug in question is a low(ish) probability lock up
> > of the serial console, I would estimate about 1-in-5 chance on the
> > boards I could actually trigger it on which it has taken me so long
> > to realise that this was an actual problem. Thinking back on it, there
> > were other failures that I would retroactively attribute to this
> > problem too, but I had earlycon disabled and I chalked the hangs at
> > "Starting kernel" down to pre-release hardware since the other boots
> > in a CI job would not fail (we boot the same config multiple times),
> > Silly me I guess!
> >
> > In a boot failure (with earlycon & keep_bootcon enabled) I see the
> > likes of [0]:
> > [ 0.485063] NFS: Registering the id_resolver key type
> > [ 0.491051] Key type id_resolver registered
> > [ 0.495743] Key type id_legacy registered
> > [ 0.500528] nfs4filelayout_init: NFSv4 File Layout Driver Registering...
> > [ 0.508010] nfs4flexfilelayout_init: NFSv4 Flexfile Layout Driver Registering...
> > [ 0.516857] 9p: Installing v9fs 9p2000 file system support
> > [ 0.524032] NET: Registered PF_ALG protocol family
> > [ 0.529562] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 247)
> > [ 0.537817] io scheduler mq-deadline registered
> > [ 0.542889] io scheduler kyber registered
> > [ 0.559682] microchip-pcie 2000000000.pcie: host bridge /soc/pcie@...0000000 ranges:
> > [ 0.568431] microchip-pcie 2000000000.pcie: MEM 0x2008000000..0x2087ffffff -> 0x0008000000
> > [ 0.635633] microchip-pcie 2000000000.pcie: axi read request error
> > [ 0.642602] microchip-pcie 2000000000.pcie: axi read timeout
> > [ 0.776630] microchip-pcie 2000000000.pcie: sec error in axi2pcie buffer
> > [ 0.784133] microchip-pcie 2000000000.pcie: axi write request error
> > [ 0.791100] microchip-pcie 2000000000.pcie: sec error in axi2pcie buffer
> > [ 0.798543] microchip-pcie 2000000000.pcie: sec error in axi2pcie buffer
> > [ 0.805981] microchip-pcie 2000000000.pcie: sec error in axi2pcie buffer
> > [ 0.813421] microchip-pcie 2000000000.pcie: sec error in axi2pcie buffer
> > [ 0.820859] microchip-pcie 2000000000.pcie: sec error in axi2pcie buffer
> > [ 0.828295] microchip-pcie 2000000000.pcie: sec error in axi2pcie buffer
> > [ 1.579668] Freeing initrd memory: 7316K
> >
> > The sec error stuff is unrelated & happens on good boots too - although
> > there are far fewer lines of "sec error" there [1] (excuse the message
> > duplication from the bootcon).
> >
> > I was able to bisect the behaviour to this set of reverts, but I have
> > low confidence as to which commit specifically caused it - it's hard
> > to know that doing 20+ reboots on every commmit I built is enough to
> > make sure I did not make a mistake. I had initially started the
> > bisection between v6.0-rc6 & v6.0-rc1, and that lead me down a sub-tree
> > of net which is why the bisect log starts at a "random" commit as that
> > merge commit was the one blamed by the inital bisection.
> >
> > git bisect start
> > # status: waiting for both good and bad commits
> > # bad: [5e8379351dbde61ea383e514f0f9ecb2c047cf4e] Merge tag 'net-5.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
> > git bisect bad 5e8379351dbde61ea383e514f0f9ecb2c047cf4e
> > # good: [4b0986a3613c92f4ec1bdc7f60ec66fea135991f] Linux 5.18
> > git bisect good 4b0986a3613c92f4ec1bdc7f60ec66fea135991f
> > # good: [fbe86daca0ba878b04fa241b85e26e54d17d4229] Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
> > git bisect good fbe86daca0ba878b04fa241b85e26e54d17d4229
> > # good: [2b8c612c6102f751e6e3e1bd425f64e9d3d3f638] kernel/reboot: Fix powering off using a non-syscall code paths
> > git bisect good 2b8c612c6102f751e6e3e1bd425f64e9d3d3f638
> > # good: [54c2cc79194c961a213c1d375fe3aa4165664cc4] Merge tag 'usb-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
> > git bisect good 54c2cc79194c961a213c1d375fe3aa4165664cc4
> > # skip: [8f7ac50c97d30ae5ae48da3b516510d05a67b9e5] Merge tag 'sound-5.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
> > git bisect skip 8f7ac50c97d30ae5ae48da3b516510d05a67b9e5
> > # good: [cdd85786f4b3b9273e4376e69aa95a2d71722764] x86/tdx: Clarify RIP adjustments in #VE handler
> > git bisect good cdd85786f4b3b9273e4376e69aa95a2d71722764
> > # good: [8cc683833726912498130a0130fc3bd0d169ef59] Merge branch 'net-dp83822-fix-interrupt-floods'
> > git bisect good 8cc683833726912498130a0130fc3bd0d169ef59
> > # bad: [0840a7914caa14315a3191178a9f72c742477860] Merge tag 'char-misc-5.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
> > git bisect bad 0840a7914caa14315a3191178a9f72c742477860
> > # bad: [38bc4ac431684498126f9baa3a530e5a132f0173] Merge tag 'drm-fixes-2022-06-24' of git://anongit.freedesktop.org/drm/drm
> > git bisect bad 38bc4ac431684498126f9baa3a530e5a132f0173
> > # good: [399bd66e219e331976fe6fa6ab81a023c0c97870] Merge tag 'net-5.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
> > git bisect good 399bd66e219e331976fe6fa6ab81a023c0c97870
> > # bad: [a237cfd6b7469d6f5eeaa45f30128ab78b5281a5] Merge tag 'block-5.19-2022-06-24' of git://git.kernel.dk/linux-block
> > git bisect bad a237cfd6b7469d6f5eeaa45f30128ab78b5281a5
> > # good: [92f20ff72066d8d7e2ffb655c2236259ac9d1c5d] Merge tag 'pm-5.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
> > git bisect good 92f20ff72066d8d7e2ffb655c2236259ac9d1c5d
> > # bad: [598f2404879336277a4320ac5000394b873e049a] Merge tag 'io_uring-5.19-2022-06-24' of git://git.kernel.dk/linux-block
> > git bisect bad 598f2404879336277a4320ac5000394b873e049a
> > # good: [386e4fb6962b9f248a80f8870aea0870ca603e89] io_uring: use original request task for inflight tracking
> > git bisect good 386e4fb6962b9f248a80f8870aea0870ca603e89
> > # good: [2d9ef940f89e0ab4fde7ba6f769d82f2a450c35a] Revert "printk: extend console_lock for per-console locking"
> > git bisect good 2d9ef940f89e0ab4fde7ba6f769d82f2a450c35a
> > # bad: [07a22b61946f0b80065b0ddcc703b715f84355f5] Revert "printk: add functions to prefer direct printing"
> > git bisect bad 07a22b61946f0b80065b0ddcc703b715f84355f5
> > # bad: [5831788afb17b89c5b531fb60cbd798613ccbb63] Revert "printk: add kthread console printers"
> > git bisect bad 5831788afb17b89c5b531fb60cbd798613ccbb63
> > # first bad commit: [5831788afb17b89c5b531fb60cbd798613ccbb63] Revert "printk: add kthread console printers"
> >
> > On 5831788afb17 ("Revert "printk: add kthread console printers"") or on
> > 07a22b61946f ("Revert "printk: add functions to prefer direct printing"")
> > I was not able to reproduce the exact error conditions that I got on
> > the current HEAD or any of the commits along my bisection, but instead I
> > got something different [2]:
>
> This kind of problems became visible with introducing printk console
> kthreads in 5.19-rc1. And it was actually the main reason why they
> were reverted in 5.19-rc4.
Right. I did notice that between rc1 & rc4 a lot of garbage messages
as two printers "competed".
> The mainline merge commits when they were introduced and reverted:
>
> + introduced: commit 537e62c865dcb9b91 ("Merge tag 'printk-for-5.19'
> of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux")
Prior to this my console behaved normally. After it, it started having
the "competition" I mentioned above. Unfortunately I did not save any
logs from this.
> + reverted: commit 9d882352bac8f2ff375 ("Merge tag 'printk-for-5.19-rc4'
> of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
After this commit is where the lockups started, I cannot reproduce it
several dozens of reboots prior to v5.19-rc1. Maybe the behaviour was
already there, but I cannot trigger any misbehaviour prior to v5.19-rc1.
> The problem might happen when console write() callback is not properly
> serialized against other operations on the same port. It might usually
> happen even without the kthreads. But the kthreads significantly
> increased the chance to hit the problem.
>
> The best solution is to fix the console driver. So far, we
> noticed two typical problems:
>
> 1. Early console did not take port->lock. For example, see the commit
> 589f892ac8ef244e47c5a0 ("serial: meson: acquire port->lock in startup()").
If I don't have an early console, I lock up at "Starting kernel" & never
get any output. Does that rule out the early console?
>
> 2. Console driver expected that the write() callback would be called
> with IRQs disabled. For example, see the commit aabdbb1b7a5819e18
> ("serial: msm_serial: disable interrupts in __msm_console_write().
I'll take a look at these I suppose & the new threaded printer stuff,
and see if either makes a difference for me. Unfortunately, ~zero chance
that I get anothing done prior to the merge window, but this is broken
for me in v5.19 so I guess that makes no odds.
I guess we should also sic regzbot on this thread so it does not get
lost?
Thanks,
Conor.
Powered by blists - more mailing lists