linux-kernel - Re: [PATCH v7 1/2] PCI: iproc: Retry request when CRS returned from EP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMSpPPctfRhf8zkeoGN73N-vm8Adn7SwieAAeRKrjhY7m_RkEw@mail.gmail.com>
Date:   Wed, 23 Aug 2017 15:57:02 +0530
From:   Oza Oza <oza.oza@...adcom.com>
To:     Bjorn Helgaas <helgaas@...nel.org>
Cc:     Bjorn Helgaas <bhelgaas@...gle.com>,
        Rob Herring <robh+dt@...nel.org>,
        Mark Rutland <mark.rutland@....com>,
        Ray Jui <rjui@...adcom.com>,
        Scott Branden <sbranden@...adcom.com>,
        Jon Mason <jonmason@...adcom.com>,
        BCM Kernel Feedback <bcm-kernel-feedback-list@...adcom.com>,
        Andy Gospodarek <gospo@...adcom.com>,
        linux-pci <linux-pci@...r.kernel.org>,
        linux-kernel@...r.kernel.org,
        Oza Pawandeep <oza.pawandeep@...il.com>,
        Sinan Kaya <okaya@...eaurora.org>,
        Timur Tabi <timur@...eaurora.org>,
        Alex Williamson <alex.williamson@...hat.com>
Subject: Re: [PATCH v7 1/2] PCI: iproc: Retry request when CRS returned from EP

On Wed, Aug 23, 2017 at 2:03 AM, Bjorn Helgaas <helgaas@...nel.org> wrote:
> [+cc Sinan, Timur, Alex]
>
> Hi Oza,
>
> On Mon, Aug 21, 2017 at 09:28:43PM +0530, Oza Pawandeep wrote:
>> PCIe spec r3.1, sec 2.3.2
>> If CRS software visibility is not enabled, the RC must reissue the
>> config request as a new request.
>>
>> - If CRS software visibility is enabled,
>> - for a config read of Vendor ID, the RC must return 0x0001 data
>> - for all other config reads/writes, the RC must reissue the
>>   request
>
> You mentioned early on that this fixes an issue with NVMe after a
> reset.  Is that reset an FLR?  I'm wondering if this overlaps with the
> work Sinan is doing:
>
>   http://lkml.kernel.org/r/20170818212310.15145.21732.stgit@bhelgaas-glaptop.roam.corp.google.com
>
> With the current code in the tree, pci_flr_wait() probably waits only
> 100ms on your system.  It currently reads PCI_COMMAND and probably
> sees 0xffff0001 because the NVMe device returns a CRS completion
> (which this iproc RC incorrectly turns into 0xffff0001 data), so we
> exit the loop after a single msleep(100).
>
> Sinan is extending pci_flr_wait() to read the Vendor ID and look for
> the CRS SV indication.  If this is where you're seeing the NVMe issue,
> I bet we can fix this by simply changing iproc_pcie_config_read() to
> return the correct data when it sees a CRS completion.  Then the
> retry loop in pci_flr_wait() will work as intended.
>

The issue with User Space poll mode drivers resulting into CRS.


root@...958742k:~# /usr/share/spdk/examples/nvme/perf -r 'trtype:PCIe
traddr:0000:01:00.0' -q 128 -s 4096 -w read -d 2048 -t 30 -c 0x1
Starting DPDK 16.11.1 initialization...
[ DPDK EAL parameters: perf -c 1 -m 2048 --file-prefix=spdk_pid2202 ]
EAL: Detected 8 lcore(s)
EAL: Probing VFIO support...
EAL: VFIO support initialized
EAL: cannot open /proc/self/numa_maps, consider that all memory is in
socket_id 0
Initializing NVMe Controllers
EAL: PCI device 0000:01:00.0 on NUMA socket 0
EAL:   probe driver: 8086:953 spdk_nvme
EAL:   using IOMMU type 1 (Type 1)
[   45.811353] vfio_ecap_init: 0000:01:00.0 hiding ecap 0x19@...a0
Attaching to NVMe Controller at 0000:01:00.0 [8086:0953]
[   46.486601] vfio_bar_restore: 0000:01:00.0 reset recovery - restoring bars
nvme_ctrlr.c:1237:nvme_ctrlr_process_init: ***ERROR*** Initialization
timed out in state 3
nvme_ctrlr.c: 414:nvme_ctrlr_shutdown: ***ERROR*** did not shutdown
within 5 seconds
EAL: Hotplug doesn't support vfio yet
spdk_nvme_probe() failed for transport address '0000:01:00.0'
/usr/share/spdk/examples/nvme/perf: errors occured

ok, so this is in the context of
[   52.500579] [<ffff000008088498>] dump_backtrace+0x0/0x208
[   52.506147] [<ffff0000080886b4>] show_stack+0x14/0x20
[   52.511357] [<ffff0000083ace34>] dump_stack+0x8c/0xb0
[   52.516567] [<ffff00000840a550>] pci_flr_wait+0x18/0xc8
[   52.521955] [<ffff00000840bf9c>] pcie_flr+0x34/0x68
[   52.526986] [<ffff00000840e3d0>] __pci_dev_reset+0x100/0x2c0
[   52.532823] [<ffff00000840e60c>] pci_try_reset_function+0x64/0x80
[   52.539108] [<ffff00000856e4b8>] vfio_pci_ioctl+0x398/0xb78
[   52.544857] [<ffff000008568378>] vfio_device_fops_unl_ioctl+0x20/0x30
[   52.551501] [<ffff0000081d4e24>] do_vfs_ioctl+0xa4/0x738
[   52.556980] [<ffff0000081d5544>] SyS_ioctl+0x8c/0xa0
[   52.562100] [<ffff000008082f8c>] __sys_trace_return+0x0/0x4


So, I have two things to do here.

1) remove retry funciton and just do following.
   data = readl(cfg_data_p);
   if (data == CFG_RETRY_STATUS)    /* 0xffff0001 */
     data = 0xffffffff;

2) refactor the code with separate patch.

but for that Sinan's patch is required.
then with both the patches I think, things will work out correctly for
iproc SOC.

Let me know how this sounds and if you agree with above two points, I
will be posting final set of patches.

let me know if you want me to change anything in the commit description.
in my opinion I should remove config write description which is irrelevant here.

Regards,
Oza.

> Sec 2.3.2 says the RC may limit the number of retries, and it doesn't
> say anything about how many retries the RC should do or what interval
> should be between them.  So I think it should be legal to say "this RC
> does zero retries".
>
> I think an RC that does zero retries and doesn't support CRS SV would
> always handle a CRS completion as a "failed transaction" so it would
> synthesize 0xffffffff data (and, I assume, set an error bit like
> Received Master Abort).  If we treated this controller as though it
> doesn't support CRS SV, the workaround in iproc_pcie_config_read()
> would be simply:
>
>   data = readl(cfg_data_p);
>   if (data == CFG_RETRY_STATUS)    /* 0xffff0001 */
>     data = 0xffffffff;
>
> That would make the loop in pci_flr_wait() that reads PCI_COMMAND work
> correctly.  It would get 0xffffffff data as long as the device
> returned CRS completions.
>
> It wouldn't make Sinan's new CRS SV code work, but that's OK: that all
> has to work correctly even on systems that don't support CRS SV.
>
> The only problem is that when we read a non-Vendor ID register that
> happens to really be 0xffff0001, we *should* return 0xffff0001, but we
> incorrectly return 0xffffffff.  But we already know we just have to
> live with that issue.
>
> One minor refactoring comment below.
>
>> iproc PCIe Controller spec:
>> 4.7.3.3. Retry Status On Configuration Cycle
>> Endpoints are allowed to generate retry status on configuration
>> cycles. In this case, the RC needs to re-issue the request. The IP
>> does not handle this because the number of configuration cycles needed
>> will probably be less than the total number of non-posted operations
>> needed.
>>
>> When a retry status is received on the User RX interface for a
>> configuration request that was sent on the User TX interface,
>> it will be indicated with a completion with the CMPL_STATUS field set
>> to 2=CRS, and the user will have to find the address and data values
>> and send a new transaction on the User TX interface.
>> When the internal configuration space returns a retry status during a
>> configuration cycle (user_cscfg = 1) on the Command/Status interface,
>> the pcie_cscrs will assert with the pcie_csack signal to indicate the
>> CRS status.
>> When the CRS Software Visibility Enable register in the Root Control
>> register is enabled, the IP will return the data value to 0x0001 for
>> the Vendor ID value and 0xffff  (all 1’s) for the rest of the data in
>> the request for reads of offset 0 that return with CRS status.  This
>> is true for both the User RX Interface and for the Command/Status
>> interface.  When CRS Software Visibility is enabled, the CMPL_STATUS
>> field of the completion on the User RX Interface will not be 2=CRS and
>> the pcie_cscrs signal will not assert on the Command/Status interface.
>>
>> Per PCIe r3.1, sec 2.3.2, config requests that receive completions
>> with Configuration Request Retry Status (CRS) should be reissued by
>> the hardware except reads of the Vendor ID when CRS Software
>> Visibility is enabled.
>>
>> This hardware never reissues configuration requests when it receives
>> CRS completions.
>> Note that, neither PCIe host bridge nor PCIe core re-issues the
>> request for any configuration offset.
>> As a result of the fact, PCIe RC driver (sw) should take care of
>> retrying.
>>
>> For config writes, there is no way for this driver to learn about
>> CRS completions. A write that receives a CRS completion will not be
>> retried and the data will be discarded. This is a potential data
>> corruption.
>>
>> For config reads, this hardware returns CFG_RETRY_STATUS data when
>> it receives a CRS completion for a config read, regardless of the
>> address of the read or the CRS Software Visibility Enable bit.  As a
>> partial workaround for this, we retry in software any read that
>> returns CFG_RETRY_STATUS.
>>
>> It implements iproc_pcie_config_read which gets called for Stingray.
>> For other iproc based SOC, it falls back to PCI generic APIs.
>>
>> Signed-off-by: Oza Pawandeep <oza.oza@...adcom.com>
>> Reviewed-by: Ray Jui <ray.jui@...adcom.com>
>> Reviewed-by: Scott Branden <scott.branden@...adcom.com>
>>
>> diff --git a/drivers/pci/host/pcie-iproc.c b/drivers/pci/host/pcie-iproc.c
>> index c574863..a3edae4 100644
>> --- a/drivers/pci/host/pcie-iproc.c
>> +++ b/drivers/pci/host/pcie-iproc.c
>> @@ -68,6 +68,9 @@
>>  #define APB_ERR_EN_SHIFT             0
>>  #define APB_ERR_EN                   BIT(APB_ERR_EN_SHIFT)
>>
>> +#define CFG_RETRY_STATUS             0xffff0001
>> +#define CFG_RETRY_STATUS_TIMEOUT_US  500000 /* 500 milli-seconds. */
>> +
>>  /* derive the enum index of the outbound/inbound mapping registers */
>>  #define MAP_REG(base_reg, index)      ((base_reg) + (index) * 2)
>>
>> @@ -448,6 +451,89 @@ static inline void iproc_pcie_apb_err_disable(struct pci_bus *bus,
>>       }
>>  }
>>
>> +static unsigned int iproc_pcie_cfg_retry(void __iomem *cfg_data_p)
>> +{
>> +     int timeout = CFG_RETRY_STATUS_TIMEOUT_US;
>> +     unsigned int data;
>> +
>> +     /*
>> +      * As per PCIe spec r3.1, sec 2.3.2, CRS Software
>> +      * Visibility only affects config read of the Vendor ID.
>> +      * For config write or any other config read the Root must
>> +      * automatically re-issue configuration request again as a
>> +      * new request.
>> +      *
>> +      * For config reads, this hardware returns CFG_RETRY_STATUS data when
>> +      * it receives a CRS completion for a config read, regardless of the
>> +      * address of the read or the CRS Software Visibility Enable bit. As a
>> +      * partial workaround for this, we retry in software any read that
>> +      * returns CFG_RETRY_STATUS.
>> +      */
>> +     data = readl(cfg_data_p);
>> +     while (data == CFG_RETRY_STATUS && timeout--) {
>> +             udelay(1);
>> +             data = readl(cfg_data_p);
>> +     }
>> +
>> +     if (data == CFG_RETRY_STATUS)
>> +             data = 0xffffffff;
>> +
>> +     return data;
>> +}
>> +
>> +static void __iomem *iproc_pcie_map_ep_cfg_reg(struct iproc_pcie *pcie,
>> +                                            unsigned int busno,
>> +                                            unsigned int slot,
>> +                                            unsigned int fn,
>> +                                            int where)
>> +{
>> +     u16 offset;
>> +     u32 val;
>> +
>> +     /* EP device access */
>> +     val = (busno << CFG_ADDR_BUS_NUM_SHIFT) |
>> +             (slot << CFG_ADDR_DEV_NUM_SHIFT) |
>> +             (fn << CFG_ADDR_FUNC_NUM_SHIFT) |
>> +             (where & CFG_ADDR_REG_NUM_MASK) |
>> +             (1 & CFG_ADDR_CFG_TYPE_MASK);
>> +
>> +     iproc_pcie_write_reg(pcie, IPROC_PCIE_CFG_ADDR, val);
>> +     offset = iproc_pcie_reg_offset(pcie, IPROC_PCIE_CFG_DATA);
>> +
>> +     if (iproc_pcie_reg_is_invalid(offset))
>> +             return NULL;
>> +
>> +     return (pcie->base + offset);
>> +}
>> +
>> +static int iproc_pcie_config_read(struct pci_bus *bus, unsigned int devfn,
>> +                                 int where, int size, u32 *val)
>> +{
>> +     struct iproc_pcie *pcie = iproc_data(bus);
>> +     unsigned int slot = PCI_SLOT(devfn);
>> +     unsigned int fn = PCI_FUNC(devfn);
>> +     unsigned int busno = bus->number;
>> +     void __iomem *cfg_data_p;
>> +     u32 data;
>> +
>> +     /* root complex access. */
>> +     if (busno == 0)
>> +             return pci_generic_config_read32(bus, devfn, where, size, val);
>> +
>> +     cfg_data_p = iproc_pcie_map_ep_cfg_reg(pcie, busno, slot, fn, where);
>> +
>> +     if (!cfg_data_p)
>> +             return PCIBIOS_DEVICE_NOT_FOUND;
>> +
>> +     data = iproc_pcie_cfg_retry(cfg_data_p);
>> +
>> +     *val = data;
>> +     if (size <= 2)
>> +             *val = (data >> (8 * (where & 3))) & ((1 << (size * 8)) - 1);
>> +
>> +     return PCIBIOS_SUCCESSFUL;
>> +}
>> +
>>  /**
>>   * Note access to the configuration registers are protected at the higher layer
>>   * by 'pci_lock' in drivers/pci/access.c
>> @@ -459,7 +545,6 @@ static void __iomem *iproc_pcie_map_cfg_bus(struct iproc_pcie *pcie,
>>  {
>>       unsigned slot = PCI_SLOT(devfn);
>>       unsigned fn = PCI_FUNC(devfn);
>> -     u32 val;
>>       u16 offset;
>>
>>       /* root complex access */
>> @@ -484,18 +569,7 @@ static void __iomem *iproc_pcie_map_cfg_bus(struct iproc_pcie *pcie,
>>               if (slot > 0)
>>                       return NULL;
>>
>> -     /* EP device access */
>> -     val = (busno << CFG_ADDR_BUS_NUM_SHIFT) |
>> -             (slot << CFG_ADDR_DEV_NUM_SHIFT) |
>> -             (fn << CFG_ADDR_FUNC_NUM_SHIFT) |
>> -             (where & CFG_ADDR_REG_NUM_MASK) |
>> -             (1 & CFG_ADDR_CFG_TYPE_MASK);
>> -     iproc_pcie_write_reg(pcie, IPROC_PCIE_CFG_ADDR, val);
>> -     offset = iproc_pcie_reg_offset(pcie, IPROC_PCIE_CFG_DATA);
>> -     if (iproc_pcie_reg_is_invalid(offset))
>> -             return NULL;
>> -     else
>> -             return (pcie->base + offset);
>> +     return iproc_pcie_map_ep_cfg_reg(pcie, busno, slot, fn, where);
>
> Thanks for factoring this out.  Can you please split that refactor
> into a separate patch?  That will make this CRS-handling patch smaller
> and simpler.
>
>>  }
>>
>>  static void __iomem *iproc_pcie_bus_map_cfg_bus(struct pci_bus *bus,
>> @@ -554,8 +628,13 @@ static int iproc_pcie_config_read32(struct pci_bus *bus, unsigned int devfn,
>>                                   int where, int size, u32 *val)
>>  {
>>       int ret;
>> +     struct iproc_pcie *pcie = iproc_data(bus);
>>
>>       iproc_pcie_apb_err_disable(bus, true);
>> +     if (pcie->type == IPROC_PCIE_PAXB_V2)
>> +             ret = iproc_pcie_config_read(bus, devfn, where, size, val);
>> +     else
>> +             ret = pci_generic_config_read32(bus, devfn, where, size, val);
>>       ret = pci_generic_config_read32(bus, devfn, where, size, val);
>>       iproc_pcie_apb_err_disable(bus, false);
>>
>> --
>> 1.9.1
>>