lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 3 Mar 2021 00:22:36 +0000
From:   Sergei Trofimovich <slyich@...il.com>
To:     John Paul Adrian Glaubitz <glaubitz@...sik.fu-berlin.de>,
        Don Brace <don.brace@...rochip.com>
Cc:     linux-ia64@...r.kernel.org, linux-kernel@...r.kernel.org,
        Joe Szczypek <jszczype@...hat.com>,
        Scott Benesh <scott.benesh@...rochip.com>,
        Scott Teel <scott.teel@...rochip.com>,
        Tomas Henzl <thenzl@...hat.com>,
        "Martin K. Petersen" <martin.petersen@...cle.com>
Subject: [bisected] 5.12-rc1 hpsa regression: "scsi: hpsa: Correct dev cmds
 outstanding for retried cmds" breaks hpsa P600

On Tue, 2 Mar 2021 23:31:32 +0100
John Paul Adrian Glaubitz <glaubitz@...sik.fu-berlin.de> wrote:

> Hi Sergei!
> 
> On 3/2/21 11:26 PM, Sergei Trofimovich wrote:
> > Gave v5.12-rc1 a try today and got a similar boot failure around
> > hpsa queue initialization, but my failure is later:
> >     https://dev.gentoo.org/~slyfox/configs/guppy-dmesg-5.12-rc1
> > Maybe I get different error because I flipped on most debugging
> > kernel options :)
> > 
> > Looks like 'ERROR: Invalid distance value range' while being
> > very scary are harmless. It's just a new spammy way for kernel
> > to report lack of NUMA config on the machine (no SRAT and SLIT
> > ACPI tables).
> > 
> > At least I get hpsa detected on PCI bus. But I guess it's discovered
> > configuration is very wrong as I get unaligned accesses:
> >     [   19.811570] kernel unaligned access to 0xe000000105dd8295, ip=0xa000000100b874d1
> > 
> > Bisecting now.  
> 
> Sounds good. I guess we should get Jens' fix for the signal regression
> merged as well as your two fixes for strace.

"bisected" (cheated halfway through) and verified that reverting
f749d8b7a9896bc6e5ffe104cc64345037e0b152 makes rx3600 boot again.

CCing authors who might be able to help us here.

commit f749d8b7a9896bc6e5ffe104cc64345037e0b152
Author: Don Brace <don.brace@...rochip.com>
Date:   Mon Feb 15 16:26:57 2021 -0600

    scsi: hpsa: Correct dev cmds outstanding for retried cmds

    Prevent incrementing device->commands_outstanding for ioaccel command
    retries that are driver initiated.  If the command goes through the retry
    path, the device->commands_outstanding counter has already accounted for
    the number of commands outstanding to the device.  Only commands going
    through function hpsa_cmd_resolve_events decrement this counter.

     - ioaccel commands go to either HBA disks or to logical volumes comprised
       of SSDs.

    The extra increment is causing device resets to hang.

     - Resets wait for all device outstanding commands to complete before
       returning.

    Replace unused field abort_pending with retry_pending. This is a
    maintenance driver so these changes have the least impact/risk.

    Link: https://lore.kernel.org/r/161342801747.29388.13045495968308188518.stgit@brunhilda
    Tested-by: Joe Szczypek <jszczype@...hat.com>
    Reviewed-by: Scott Benesh <scott.benesh@...rochip.com>
    Reviewed-by: Scott Teel <scott.teel@...rochip.com>
    Reviewed-by: Tomas Henzl <thenzl@...hat.com>
    Signed-off-by: Don Brace <don.brace@...rochip.com>
    Signed-off-by: Martin K. Petersen <martin.petersen@...cle.com>

Don, do you happen to know why this patch caused some controller init failure
for device
    14:01.0 RAID bus controller: Hewlett-Packard Company Smart Array P600
?

Boot failure: https://dev.gentoo.org/~slyfox/configs/guppy-dmesg-5.12-rc1
Boot success: https://dev.gentoo.org/~slyfox/configs/guppy-dmesg-5.12-rc1-good

The difference between the two boots is 
f749d8b7a9896bc6e5ffe104cc64345037e0b152 reverted on top of 5.12-rc1
in -good case.

Looks like hpsa controller fails to initialize in bad case (could be a race?).

-- 

  Sergei

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ