linux-kernel - Re: [PATCHv9 00/12] PCI: Recode Mobiveil driver and add PCIe Gen4 driver for NXP Layerscape SoCs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAOesGMjJS0SfNwQoBqL8Y1G4Uj0YDBf+EWP4MHCnVWnZF2DyyA@mail.gmail.com>
Date:   Mon, 10 Feb 2020 19:33:19 +0100
From:   Olof Johansson <olof@...om.net>
To:     Russell King - ARM Linux admin <linux@...linux.org.uk>
Cc:     "mark.rutland@....com" <mark.rutland@....com>,
        "devicetree@...r.kernel.org" <devicetree@...r.kernel.org>,
        Lorenzo Pieralisi <lorenzo.pieralisi@....com>,
        "arnd@...db.de" <arnd@...db.de>,
        "m.karthikeyan@...iveil.co.in" <m.karthikeyan@...iveil.co.in>,
        "linux-pci@...r.kernel.org" <linux-pci@...r.kernel.org>,
        "Z.q. Hou" <zhiqiang.hou@....com>,
        "l.subrahmanya@...iveil.co.in" <l.subrahmanya@...iveil.co.in>,
        "will.deacon@....com" <will.deacon@....com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Leo Li <leoyang.li@....com>,
        "M.h. Lian" <minghuan.lian@....com>,
        "robh+dt@...nel.org" <robh+dt@...nel.org>,
        Xiaowei Bao <xiaowei.bao@....com>,
        "catalin.marinas@....com" <catalin.marinas@....com>,
        "bhelgaas@...gle.com" <bhelgaas@...gle.com>,
        "andrew.murray@....com" <andrew.murray@....com>,
        "shawnguo@...nel.org" <shawnguo@...nel.org>,
        Mingkai Hu <mingkai.hu@....com>,
        "linux-arm-kernel@...ts.infradead.org" 
        <linux-arm-kernel@...ts.infradead.org>,
        honeycomb-users@...ts.infradead.org
Subject: Re: [PATCHv9 00/12] PCI: Recode Mobiveil driver and add PCIe Gen4
 driver for NXP Layerscape SoCs

[cc:ing honeycomb-users, didn't think of that earlier]

On Mon, Feb 10, 2020 at 5:16 PM Russell King - ARM Linux admin
<linux@...linux.org.uk> wrote:
>
> On Mon, Feb 10, 2020 at 04:28:23PM +0100, Olof Johansson wrote:
> > On Mon, Feb 10, 2020 at 4:23 PM Russell King - ARM Linux admin
> > <linux@...linux.org.uk> wrote:
> > >
> > > On Mon, Feb 10, 2020 at 04:12:30PM +0100, Olof Johansson wrote:
> > > > On Thu, Feb 6, 2020 at 11:57 AM Z.q. Hou <zhiqiang.hou@....com> wrote:
> > > > >
> > > > > Hi Olof,
> > > > >
> > > > > Thanks a lot for your comments!
> > > > > And sorry for my delay respond!
> > > >
> > > > Actually, they apply with only minor conflicts on top of current -next.
> > > >
> > > > Bjorn, any chance we can get you to pick these up pretty soon? They
> > > > enable full use of a promising ARM developer system, the SolidRun
> > > > HoneyComb, and would be quite valuable for me and others to be able to
> > > > use with mainline or -next without any additional patches applied --
> > > > which this patchset achieves.
> > > >
> > > > I know there are pending revisions based on feedback. I'll leave it up
> > > > to you and others to determine if that can be done with incremental
> > > > patches on top, or if it should be fixed before the initial patchset
> > > > is applied. But all in all, it's holding up adaption by me and surely
> > > > others of a very interesting platform -- I'm looking to replace my
> > > > aging MacchiatoBin with one of these and would need PCIe/NVMe to work
> > > > before I do.
> > >
> > > If you're going to be using NVMe, make sure you use a power-fail safe
> > > version; I've already had one instance where ext4 failed to mount
> > > because of a corrupted journal using an XPG SX8200 after the Honeycomb
> > > Serror'd, and then I powered it down after a few hours before later
> > > booting it back up.
> > >
> > > EXT4-fs (nvme0n1p2): INFO: recovery required on readonly filesystem
> > > EXT4-fs (nvme0n1p2): write access will be enabled during recovery
> > > JBD2: journal transaction 80849 on nvme0n1p2-8 is corrupt.
> > > EXT4-fs (nvme0n1p2): error loading journal
> >
> > Hmm, using btrfs on mine, not sure if the exposure is similar or not.
>
> As I understand the problem, it isn't a filesystem issue.  It's a data
> integrity issue with the NVMe over power fail, how they cache the data,
> and ultimately write it to the nand flash.
>
> Have a read of:
>
> https://www.kingston.com/en/solutions/servers-data-centers/ssd-power-loss-protection
>
> As NVMe and SSD are basically the same underlying technology (the host
> interface is different) and the issues I've heard, and now experienced
> with my NVMe, I think the above is a good pointer to the problems of
> flash mass storage.
>
> As I understand it, the problem occurs when the mapping table has not
> been written back to flash, power is lost without the Standby Immediate
> command being sent, and there is no way for the firmware to quickly
> save the table.  On subsequent power up, the firmware has to
> reconstruct the mapping table, and depending on how that is done,
> incorrect (old?) data may be returned for some blocks.
>
> That can happen to any blocks on the drive, which means any data can
> be at risk from a power loss event, whether that is a power failure
> or after a crash.

Makes me suspect if there's some board-level power/reset sequencing
issue, or if there's a problem with one card going down disabling
others. I haven't read the specs enough to know what's expected
behavior but I've seen similar issues on other platforms so take it
with a grain of salt.

> > Do you know if the SErr was due to a known issue and/or if it's
> > something that's fixed in production silicon?
>
> The SError is triggered by something on the PCIe side of things; if I
> leave the Mellanox PCIe card out, then I don't get them.  The errata
> patches I have merged into my tree help a bit, turning the code from
> being unable to boot without a SError with the card plugged in, to
> being able to boot and last a while - but the SErrors still eventually
> come, maybe taking a few days... and that's without the Mellanox
> ethernet interface being up.
>
> > (I still can't enable SMMU since across a warm reboot it fails
> > *completely*, with nothing coming up and working. NXP folks, you
> > listening? :)
>
> Is it just a warm reboot?  I thought I saw SMMU activity on a cold
> boot as well, implying that there were devices active that Linux
> did not know about.

Yeah, 100% reproducible on warm reboot -- every single time. Not on
cold boot though (100% success rate as far as I remember). I boot with
kernel on NVMe on PCIe, native 1GbE for networking. u-boot from SD
card.

This is with the SolidRun u-boot from GitHub.


-Olof