[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9baecde9-d92b-c18c-daa8-e7a96baa019b@nbd.name>
Date: Wed, 25 Jan 2023 18:32:48 +0100
From: Felix Fietkau <nbd@....name>
To: Alexander H Duyck <alexander.duyck@...il.com>,
Ilias Apalodimas <ilias.apalodimas@...aro.org>
Cc: netdev@...r.kernel.org, Jesper Dangaard Brouer <hawk@...nel.org>,
"David S. Miller" <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>,
Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>,
Lorenzo Bianconi <lorenzo@...nel.org>,
linux-kernel@...r.kernel.org, Yunsheng Lin <linyunsheng@...wei.com>
Subject: Re: [PATCH] net: page_pool: fix refcounting issues with fragmented
allocation
On 25.01.23 18:11, Alexander H Duyck wrote:
> On Tue, 2023-01-24 at 22:30 +0100, Felix Fietkau wrote:
>> On 24.01.23 22:10, Alexander H Duyck wrote:
>> > On Tue, 2023-01-24 at 18:22 +0100, Felix Fietkau wrote:
>> > > On 24.01.23 15:11, Ilias Apalodimas wrote:
>> > > > Hi Felix,
>> > > >
>> > > > ++cc Alexander and Yunsheng.
>> > > >
>> > > > Thanks for the report
>> > > >
>> > > > On Tue, 24 Jan 2023 at 14:43, Felix Fietkau <nbd@....name> wrote:
>> > > > >
>> > > > > While testing fragmented page_pool allocation in the mt76 driver, I was able
>> > > > > to reliably trigger page refcount underflow issues, which did not occur with
>> > > > > full-page page_pool allocation.
>> > > > > It appears to me, that handling refcounting in two separate counters
>> > > > > (page->pp_frag_count and page refcount) is racy when page refcount gets
>> > > > > incremented by code dealing with skb fragments directly, and
>> > > > > page_pool_return_skb_page is called multiple times for the same fragment.
>> > > > >
>> > > > > Dropping page->pp_frag_count and relying entirely on the page refcount makes
>> > > > > these underflow issues and crashes go away.
>> > > > >
>> > > >
>> > > > This has been discussed here [1]. TL;DR changing this to page
>> > > > refcount might blow up in other colorful ways. Can we look closer and
>> > > > figure out why the underflow happens?
>> > > I don't see how the approch taken in my patch would blow up. From what I
>> > > can tell, it should be fairly close to how refcount is handled in
>> > > page_frag_alloc. The main improvement it adds is to prevent it from
>> > > blowing up if pool-allocated fragments get shared across multiple skbs
>> > > with corresponding get_page and page_pool_return_skb_page calls.
>> > >
>> > > - Felix
>> > >
>> >
>> > Do you have the patch available to review as an RFC? From what I am
>> > seeing it looks like you are underrunning on the pp_frag_count itself.
>> > I would suspect the issue to be something like starting with a bad
>> > count in terms of the total number of references, or deducing the wrong
>> > amount when you finally free the page assuming you are tracking your
>> > frag count using a non-atomic value in the driver.
>> The driver patches for page pool are here:
>> https://patchwork.kernel.org/project/linux-wireless/patch/64abb23f4867c075c19d704beaae5a0a2f8e8821.1673963374.git.lorenzo@kernel.org/
>> https://patchwork.kernel.org/project/linux-wireless/patch/68081e02cbe2afa2d35c8aa93194f0adddbd0f05.1673963374.git.lorenzo@kernel.org/
>>
>> They are also applied in my mt76 tree at:
>> https://github.com/nbd168/wireless
>>
>> - Felix
>
> So one thing I am thinking is that we may be seeing an issue where we
> are somehow getting a mix of frag and non-frag based page pool pages.
> That is the only case I can think of where we might be underflowing
> negative. If you could add some additional debug info on the underflow
> WARN_ON case in page_pool_defrag_page that might be useful.
> Specifically I would be curious what the actual return value is. I'm
> assuming we are only hitting negative 1, but I would want to verify we
> aren't seeing something else.
I'll try to run some more tests soon. However, I think I found the piece
of code that is incompatible with using pp_frag_count.
When receiving an A-MSDU packet (multiple MSDUs within a single 802.11
packet), and it is not split by the hardware, a cfg80211 function
extracts the individual MSDUs into separate skbs. In that case, a
fragment can be shared across multiple skbs, and get_page is used to
increase the refcount.
You can find this in net/wireless/util.c: ieee80211_amsdu_to_8023s (and
its helper functions).
This code also has a bug where it doesn't set pp_recycle on the newly
allocated skb if the previous one has it, but that's a separate matter
and fixing it doesn't make the crash go away.
Is there any way I can make that part of the code work with the current
page pool frag implementation?
> Also just to confirm this is building on 64b kernel correct? Just want
> to make sure we don't have this running on a 32b setup where the frag
> count and the upper 32b of the DMA address are overlapped.
Yes, I'm using a 64b kernel.
> As far as the patch set I only really see a few minor issues which I am
> going to post a few snippets below.
>
>
>> diff --git a/drivers/net/wireless/mediatek/mt76/dma.c
>> b/drivers/net/wireless/mediatek/mt76/dma.c
>> index 611769e445fa..7fd9aa9c3d9e 100644
>
> ...
>
>> @@ -593,25 +593,28 @@ mt76_dma_rx_fill(struct mt76_dev *dev, struct
>> mt76_queue *q)
>>
>> while (q->queued < q->ndesc - 1) {
>> struct mt76_queue_buf qbuf;
>> - void *buf = NULL;
>> + dma_addr_t addr;
>> + int offset;
>> + void *buf;
>>
>> - buf = page_frag_alloc(&q->rx_page, q->buf_size,
>> GFP_ATOMIC);
>> + buf = mt76_get_page_pool_buf(q, &offset, q-
>> >buf_size);
>> if (!buf)
>> break;
>>
>> - addr = dma_map_single(dev->dma_dev, buf, len,
>> DMA_FROM_DEVICE);
>> + addr = dma_map_single(dev->dma_dev, buf + offset,
>> len,
>> + DMA_FROM_DEVICE);
>
> Offset was already added to buf in mt76_get_page_pool_buf so the DMA
> mapping offset doesn't look right to me.
Right. This is resolved by the follow-up patch which keeps pages DMA
mapped. I plan on squashing both patches into one and adding some fixes
on top when the underlying page pool issue is resolved.
>> if (unlikely(dma_mapping_error(dev->dma_dev, addr)))
>> {
>> - skb_free_frag(buf);
>> + mt76_put_page_pool_buf(buf, allow_direct);
>> break;
>> }
>>
>
> I'm not a fan of the defensive programming in mt76_put_page_pool_buf.
> If you are in an area that is using page pool you should be using the
> page pool version of the freeing operations instead of adding
> additional overhead that can mess things up by having it have to also
> check for if the page is a page pool page or not.
See below.
>> - qbuf.addr = addr + offset;
>> - qbuf.len = len - offset;
>> + qbuf.addr = addr + q->buf_offset;
>> + qbuf.len = len - q->buf_offset;
>> qbuf.skip_unmap = false;
>> if (mt76_dma_add_rx_buf(dev, q, &qbuf, buf) < 0) {
>> dma_unmap_single(dev->dma_dev, addr, len,
>> DMA_FROM_DEVICE);
>> - skb_free_frag(buf);
>> + mt76_put_page_pool_buf(buf, allow_direct);
>> break;
>> }
>> frames++;
>
> ...
>
>> @@ -848,6 +847,8 @@ mt76_dma_rx_process(struct mt76_dev *dev, struct
>> mt76_queue *q, int budget)
>> goto free_frag;
>>
>> skb_reserve(skb, q->buf_offset);
>> + if (mt76_is_page_from_pp(data))
>> + skb_mark_for_recycle(skb);
>>
>> *(u32 *)skb->cb = info;
>>
>
> More defensive programming here. Is there a path that allows for a
> mixed setup?
>
> The only spot where I can see there being anything like that is in
> /drivers/net/wireless/mediatek/mt76/mt7915/mmio.c. But it doesn't make
> any sense to me as to why it was included in the patch. It might be
> easier to sort out the issue if we were to get rid of some of the
> defensive programming.
This is not defensive programming. In its current state, there is a
scenario where we can have a mix of pp and non-pp pages (if hardware
offload support is enabled).
However in my tests, offload support was disabled and all pages are PP
ones.
I also have some unpublished pending changes to always allocate from the
pool (even for the initial buffers allocated for offloading).
This did not make a difference in my tests though.
- Felix
Powered by blists - more mailing lists