linux-kernel - Re: workqueue lockup due to process_unsol_events stuck in azx_rirb_get

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <s5hwpdjjcgn.wl-tiwai@suse.de>
Date:   Wed, 25 Jan 2017 18:06:48 +0100
From:   Takashi Iwai <tiwai@...e.de>
To:     Vlastimil Babka <vbabka@...e.cz>
Cc:     Jaroslav Kysela <perex@...ex.cz>, alsa-devel@...a-project.org,
        LKML <linux-kernel@...r.kernel.org>
Subject: Re: workqueue lockup due to process_unsol_events stuck in azx_rirb_get_response

On Wed, 25 Jan 2017 18:03:38 +0100,
Vlastimil Babka wrote:
> 
> On 01/25/2017 03:54 PM, Takashi Iwai wrote:
> > On Wed, 25 Jan 2017 13:28:11 +0100,
> > Vlastimil Babka wrote:
> >>
> >> Hi,
> >>
> >> my desktop randomly experiences workqueue lockups on boot with
> >> openSUSE Tumbleweed kernels 4.9.x, installed around
> >> Christmas. Previously I had a (badly maintained) Gentoo installation
> >> with 4.4 IIRC, so I can't say if the kernel has regressed, or the
> >> major userspace changes exposed different timing of stuff.
> >
> > If the lockup can be reproduced easily, could you check whether the
> > old kernel shows the issue?  I don't remember of any big changes in
> > ca0132 driver in 4.x kernels.  It'd be helpful even just checking
> > an openSUSE Leap 42.1 or 42.2 kernel.
> >
> >> This is how the workqueue lockup looks like:
> > (snip)
> >> kernel:  [<ffffffffc0c20501>] dspio_read+0x51/0x70 [snd_hda_codec_ca0132]
> >> kernel:  [<ffffffffc0c20566>] ca0132_process_dsp_response+0x46/0x160
> >> [snd_hda_codec_ca0132]
> >> kernel:  [<ffffffffc0c02fe5>] call_jack_callback.isra.1+0x25/0xa0 [snd_hda_codec]
> >> kernel:  [<ffffffffc0c033c6>] snd_hda_jack_unsol_event+0x66/0x80 [snd_hda_codec]
> >> kernel:  [<ffffffffc0bfd077>] hda_codec_unsol_event+0x17/0x20 [snd_hda_codec]
> >> kernel:  [<ffffffffc0b86193>] process_unsol_events+0x63/0x70 [snd_hda_core]
> >
> > This is the code path that runs when the codec chip (CA0132) receives
> > an unsolicited event with a specific tag (0x16).  It means the DSP
> > communication going.
> 
> Oh, so it is actually the unused Creative card after all. Wonder what
> "jack" event it processes, since no jack is plugged in...
> 
> > Possibly the bug is due to the recursive runtime PM handling.  Could
> > you check the patch below?
> 
> Hmm, so the issue didn't happen when rebooting with this patch on top
> of current kernel-source stable branch (i.e. 4.9.5). But then I did a
> full poweroff by mistake, and now I can't reproduce it even with the
> original kernel. Before the poweroff it persisted over each reboot
> today, so perhaps the card was in some specific state and now it's
> not... Might be also related to dual boot with Win10 and whatever its
> driver does to it and it persists over reboot? I'll keep using the
> nonpatched kernel until I hit the problem again and then try to test
> the patched kernel more times. Thanks so far!

The code path is related with the runtime PM, so it's likely depending
on the device state, e.g. long-time pause or such.  I don't think Win
10 plays a role, but who knows.

In anyway, let me know if this helps.  Basically I can merge it even
for now, as the fix shouldn't give a regression.  But of course it'd
be better to have a test result :)


thanks,

Takashi