linux-kernel - Re: Memory barrier needed with wake_up

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <87eg03mf69.fsf@linux.intel.com>
Date:   Mon, 16 Jan 2017 13:12:46 +0200
From:   Felipe Balbi <felipe.balbi@...ux.intel.com>
To:     Alan Stern <stern@...land.harvard.edu>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
        Ingo Molnar <mingo@...hat.com>,
        USB list <linux-usb@...r.kernel.org>,
        Kernel development list <linux-kernel@...r.kernel.org>,
        Will Deacon <will.deacon@....com>
Subject: Re: Memory barrier needed with wake_up_process()?


Hi,

Alan Stern <stern@...land.harvard.edu> writes:
> On Tue, 20 Sep 2016, Felipe Balbi wrote:
>
>> And here's trace output (complete, scroll to bottom). It seems to me
>> like the thread didn't wake up at all. It didn't even try to
>> execute. I'll add some more traces and try to get better information
>> about what's going on.
>> 
>> 
>> 
>> # tracer: nop
>> #
>> # entries-in-buffer/entries-written: 2865/2865   #P:4
>> #
>> #                              _-----=> irqs-off
>> #                             / _----=> need-resched
>> #                            | / _---=> hardirq/softirq
>> #                            || / _--=> preempt-depth
>> #                            ||| /     delay
>> #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
>> #              | |       |   ||||       |         |
>
> Skipping to the end...
>
>>      irq/17-dwc3-2522  [002] d...    43.504199: bulk_out_complete: 0, 31/31
>>     file-storage-2521  [001] ....    43.504202: fsg_main_thread: get_next_command -> 0
>>     file-storage-2521  [001] ....    43.504288: fsg_main_thread: do_scsi_command -> 0
>>     file-storage-2521  [001] ....    43.504295: fsg_main_thread: finish_reply -> 0
>>     file-storage-2521  [001] ....    43.504298: fsg_main_thread: send_status -> 0
>>      irq/17-dwc3-2522  [002] d...    43.504347: bulk_in_complete: 0, 16384/16384
>>      irq/17-dwc3-2522  [002] d...    43.504351: bulk_in_complete: 0, 16384/16384
>>      irq/17-dwc3-2522  [002] d...    43.504434: bulk_in_complete: 0, 16384/16384
>>      irq/17-dwc3-2522  [002] d...    43.504438: bulk_in_complete: 0, 16384/16384
>>      irq/17-dwc3-2522  [002] d...    43.504535: bulk_in_complete: 0, 16384/16384
>>      irq/17-dwc3-2522  [002] d...    43.504539: bulk_in_complete: 0, 16384/16384
>>      irq/17-dwc3-2522  [002] d...    43.504618: bulk_in_complete: 0, 16384/16384
>>      irq/17-dwc3-2522  [002] d...    43.504703: bulk_in_complete: 0, 16384/16384
>>      irq/17-dwc3-2522  [002] d...    43.504794: bulk_in_complete: 0, 13/13
>>      irq/17-dwc3-2522  [002] d...    43.504797: bulk_out_complete: 0, 31/31
>
> Like you say, it appears that the thread didn't get woken up at all.
> But this is inconsistent with your earlier results.  On Sep. 9, you 
> posted a message that ended with these lines:
>
>>      irq/17-dwc3-3579  [003] d..1 21167.729666: bulk_out_complete: compl: bh ffff880111e6aac0 state 1
>>     file-storage-3578  [002] .... 21167.729670: fsg_main_thread: next: bh ffff880111e6aac0 state 1
>
> This indicates that in the earlier test, the thread did start running 
> and get_next_command should have returned.
>
> The trace you posted after this one doesn't seem to show anything new, 
> as far as I can tell.
>
> So I still can't tell what's happening.  Maybe the patch below will 
> help.  It concentrates on the critical area.

Sorry for the long delay, I finally have more information on this. All
this time I was doing something that I never considered to matter: I've
been running host and peripheral on the same machine. Now that I have
tracepoints on xHCI as well, I could see that these 30 seconds of
"nothing" is actuall full of xHCI activity and I can see that for the
duration of these 30 seconds preempt depth on the CPU that (eventually)
queues a request on dwc3, is always > 1 (sometimes 2, most of the time
1). My conclusion from that is that xHCI (or usbcore ?!?) locks the CPU
and g_mass_storage is spinning for over 30 seconds at which point
storage.ko (host side class driver) dequeues the request.

I'll see if I can capture a fresh trace with both xHCI and dwc3 with
this happening, but probably not today (testing stuff for -rc).

-- 
balbi

Download attachment "signature.asc" of type "application/pgp-signature" (833 bytes)