[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMaF-rN3tV4ouS-=LHbP2k00PdNReTG0wuOwZJYKhGD1+vSE2A@mail.gmail.com>
Date: Tue, 13 Sep 2011 15:55:52 -0500
From: Jon Mason <mason@...i.com>
To: Benjamin Herrenschmidt <benh@...nel.crashing.org>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>,
linux-kernel@...r.kernel.org, linux-pci@...r.kernel.org,
Jesse Barnes <jbarnes@...tuousgeek.org>
Subject: Re: [PATCH 0/2] PCI: PCI-E MPS bug fixes
On Tue, Sep 13, 2011 at 5:50 AM, Benjamin Herrenschmidt
<benh@...nel.crashing.org> wrote:
>
>>
>> You should apply them. I would queue them, and but I don't have
>> another PCI tree anywhere for you to pull them.
>>
>> In fact, I'll be out for the next few weeks; I'll ask Bjorn to cover
>> for me.
>
> Note that I might be missing some patches, but what's currently in Linus
> tree is rather wrong.
It is worth noting that with my last patch, this is not the default
behavior. Whether it is faulty is open for debate.
>
> The comment is right, but the implementation is bogus:
>
> static void pcie_write_mps(struct pci_dev *dev, int mps)
> {
> int rc, dev_mpss;
>
> dev_mpss = 128 << dev->pcie_mpss;
>
> if (pcie_bus_config == PCIE_BUS_PERFORMANCE) {
> if (dev->bus->self) {
> dev_dbg(&dev->bus->dev, "Bus MPSS %d\n",
> 128 << dev->bus->self->pcie_mpss);
>
> /* For "MPS Force Max", the assumption is made that
> * downstream communication will never be larger than
> * the MRRS. So, the MPS only needs to be configured
> * for the upstream communication. This being the case,
> * walk from the top down and set the MPS of the child
> * to that of the parent bus.
> */
> mps = 128 << dev->bus->self->pcie_mpss;
>
> So here, we decide to use as an mps, the mpss of the bridge. That's already not
> quite right. The bridge might itself be clamped down (because it's parent has
> a smaller MPSS, well, your other bugs below will prevent that but that's how
> it should be)
This is started from the root complex. So, it should be able to
handle the maximum setting of the bridge chip.
>
> So we should use the MPS of the bridge not the MPSS (ie the configured MPS of
> the bridge, not the max MPS of the bridge).
>
> That does mean that we do need to configure the bridge first before we configure
> the bridge children, I don't remember if that's what happens with the walk function
> so we need to dbl check (I have to run, no time to check right now)
Yes, that is the behavior.
>
> if (mps > dev_mpss)
> dev_warn(&dev->dev, "MPS configured higher than"
> " maximum supported by the device. If"
> " a bus issue occurs, try running with"
> " pci=pcie_bus_safe.\n");
>
> Yeah right. Instead of warning, we should actually -clamp- it. IE If the
> bridge can do 4096 bytes and the device can only do 128 bytes, it's crazy
> to try to set the device to 4096 bytes.
You are correct. I was looking at this code when the bugs started
coming in and realized this was erroneous. A simple mps=min(mps,
dev_mpss) should fix it.
> The idea is that we set the device to the higher it can up to the size
> supported by the brigde, not the exact size of the brigde regardless of
> whether that's bigger than what the device can handle.
>
> It -never- makes any sense to configure a device to an MPS larger than
> it's own maxium.
>
> }
>
> dev->pcie_mpss = ffs(mps) - 8;
>
> And here I'm baffled. Why on earth would we -ever- want to change dev->pcie_mpss ?
I was being overly clever here and using it as a way to actual MPS of
the bridges. A comment here would be helpful, but in hindsight I
think it is a bad idea to do it this way.
>
> This is the the HARD maximum supported by the device, as read from the capability
> register. Changing our idea of what the device supports as a maximum isn't going
> to help us... all it will do is allow the subsequent:
>
> }
>
> rc = pcie_set_mps(dev, mps);
> if (rc)
> dev_err(&dev->dev, "Failed attempting to set the MPS\n");
>
> To actually try to program the bogus mps value we just calculated (ie the device
> larger than what it supports).
>
> }
>
> The right thing to do here for the "performance" case is to use the min() of
> the bus mps and the dev_mpss basically.
>
> Ie. Going top down:
>
> - If !parent, mps = mpss else mps = min(mpss, parent->mps)
> - Iterate children
>
> That's it.
>
> So of course, the current implementation -will- break things. It might even explain
> some of the radeon problems you've been observing, ie, might have nothing to do with
> MRRS...
My patch before that removed MRRS setting from the "performance"
option solved a majority of the problems seen, but as you said in
response to that patch, it was buggy.
>
> As to whether we should default to "safe" or "performance", well, it really depends
> on the platform.
>
> I know for a fact on power that my host bridges are never going to generate a downstream
> request with a TLP larger than 128 bytes, and we never do PCIe <-> PCIe transfers,
> so "performance" is good for me.
>
> x86 with funky DMA engines in the CPU etc... might be capable of generating larger
> TLPs, in which case it must switch to "safe".
Only is it is not observing the MRRS value.
>
> Note that this whole business is really a mis-design of PCIe. The 'performance' mode is
> a way to try to work around it and not end up with everything clamped down to the minimum
> MPS (128 bytes) all the time, but it's a bit risky. Ideally the spec should have separated
> at least the MPS of packets generated vs. the MPS of packets received at the root
> complex level.
And a fix for poorly written BIOS (the original cause of this patch).
>
> Now, I am giving some courses/training today and may not have time to propose a patch,
> but if you don't get to it by tomorrow I'll see what I can do.
I'll make a pass at this shortly and send it to linux-pci. Thanks for
going over the code.
>
> Cheers,
> Ben.
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists