lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200817184753.GA120209@rowland.harvard.edu>
Date:   Mon, 17 Aug 2020 14:47:53 -0400
From:   Alan Stern <stern@...land.harvard.edu>
To:     Jim Baxter <jim_baxter@...tor.com>
Cc:     Greg KH <gregkh@...uxfoundation.org>, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org, linux-usb@...r.kernel.org,
        "Resch Carsten \(CM/ESO6\)" <Carsten.Resch@...bosch.com>,
        "Rosca, Eugeniu \(ADITG/ESB\)" <erosca@...adit-jv.com>
Subject: Re: PROBLEM: Long Workqueue delays.

On Mon, Aug 17, 2020 at 07:24:01PM +0100, Jim Baxter wrote:
> On 17/08/2020 12:57, Greg KH wrote:
> > On Mon, Aug 17, 2020 at 12:40:03PM +0100, Jim Baxter wrote:
> >> We have issues with the workqueue of the kernel overloading the CPU 0 
> >> when we we disconnect a USB stick.
> >>
> >> This results in other items on the shared workqueue being delayed by
> >> around 6.5 seconds with a default kernel configuration and 2.3 seconds
> >> on a config tailored for our RCar embedded platform.
> >>
> > 
> > Is this data really flushed out to the device?

In the example, data was copied _from_ the device, not to it.

> I am unsure if the delay is due to a single system or a combination of memory,
> usb or filesystem operations, the delay also occurs if the device is mounted
> as ro and using the sync option.
> 
> Using umount stops the issue occurring but is unfortunately not guaranteed in
> our particular system.

Unplugging a R/W USB drive without unmounting it first is a great way to 
corrupt the data.

> >> - Disconnect the USB stick:
> >> [ 1551.796792] usb 2-1: USB disconnect, device number 2
> >> [ 1558.625517] DELAY: 6782
> >>
> >>
> >> The Delay output 6782 is in milliseconds.
> > 
> > What USB workqueue is taking so long?> 
> > The one trying to deal with the filesystem flushing out the data that it
> > can't do now that the device is removed?  :)
> > 
> From my analysis it is the hub_event workqueue shown to be using most of the CPU,
> the kworker/0:1+usb thread uses around 98% of the CPU.
> 
> I have traced the workqueue:workqueue_queue_work function while unplugging the USB
> but not found a particular workqueue function being called a lot.
> 
> Using perf Iidentified the hub_events workqueue was spending a lot of time in
> invalidate_partition(), I have included a cut down the captured data from perf in
> [2] which shows the additional functions where the kworker spends most of its time.

invalidate_partition() is part of the block layer, not part of USB.  It 
gets called whenever a drive is removed from the system, no matter what 
type of drive it is.  You should ask the people involved in that 
subsystem why it takes so long.

> I realise that not unmounting the USB stick is not ideal, though I wonder what 
> additional work is done when unplugging the USB stick compared to unmounting it.

Unmounting a drive flushes all the dirty buffers from memory back to the 
drive.  Obviously that can't be done if the drive is unplugged first.

As far as the USB subsystem is concerned, exactly the same amount of 
work is done during disconnect regardless of whether or not the drive is 
mounted.  (In fact, the USB subsystem doesn't even know whether a drive 
is mounted; that concept is part of the block and filesystem layers.)

> I guess it may be waiting for a time-out during the operation without the unmount.

That seems very unlikely.  When a USB device gets unplugged the system 
realizes it.  Any I/O meant for that device is immediately cancelled; 
there are no timeouts.

(Okay, not strictly true; there is a fraction-of-a-second timeout during 
which the system waits to see whether the disconnect was permanent or 
just a temporary glitch.  But you're talking about 6-second long 
delays.)

Alan Stern

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ