lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Thu, 13 Nov 2014 05:33:06 +0000
From:	"Elliott, Robert (Server Storage)" <Elliott@...com>
To:	Barto <mister.freeman@...oste.net>,
	Guenter Roeck <linux@...ck-us.net>,
	Bjorn Helgaas <bhelgaas@...gle.com>
CC:	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-scsi@...r.kernel.org" <linux-scsi@...r.kernel.org>,
	Joe Perches <joe@...ches.com>
Subject: RE: BUG in scsi_lib.c due to a bad commit



> -----Original Message-----
> From: linux-kernel-owner@...r.kernel.org [mailto:linux-kernel-
> owner@...r.kernel.org] On Behalf Of Barto
> Sent: Wednesday, November 12, 2014 9:28 PM
> To: Guenter Roeck; Bjorn Helgaas
> Cc: linux-kernel@...r.kernel.org; linux-scsi@...r.kernel.org; Joe
> Perches
> Subject: Re: BUG in scsi_lib.c due to a bad commit
> 
> reverting your commit 045065d8a300a37218c is a solution, but it's just a
> temporary solution,
> 
> it's better to search why your commit can create a random hang on boot
> on some PC configurations,
> 
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -1774,7 +1774,7 @@ static void scsi_request_fn(struct request_queue
> *q)
> blk_requeue_request(q, req);
> atomic_dec(&sdev->device_busy);
> out_delay:
> - if (atomic_read(&sdev->device_busy) && !scsi_device_blocked(sdev))
> + if (!atomic_read(&sdev->device_busy) && !scsi_device_blocked(sdev))
> blk_delay_queue(q, SCSI_QUEUE_DELAY);
> }
> 
> perhaps the atomic_read() function doesn't make the expected job on some
> rare circonstances, I have the same doubts about the blk_delay_queue()
> function

Were you running with scsi_mod.use_blk_mq=Y or =N?

device_busy is the active queue depth for the device (e.g.
5 means there are 5 commands submitted but not yet completed).

The function reaches this code if it has run out of tags, the host
has reached its limit of outstanding commands, or the target has
reached its limit.  It requeus the request:
* with delay if device_busy is zero
* without delay if device_busy is non_zero

I think this is the reasoning:
If device_busy is zero, trying to process the request again will
probably run into the same problem; a delay gives time for the
situation to change.  If device_busy is non-zero, then the 
requeued command goes behind others and might get a different
result.

With the polarity backwards, the lack of delay hung PA-RISC 
and SPARC64 systems), not just QEMU.  So, I don't think reverting
the fix is good.

Changing it to an unconditional delay might be safe - delay
regardless of device_busy (until the root cause is understood).

Also, SCSI_QUEUE_DELAY seems like an arbitrary magic number; 
maybe that value isn't working correctly anymore?


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ