[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iKvA6FBbN2o8dnisd_VFrFfm+WFqMFX50gdQCpRTdpX1Q@mail.gmail.com>
Date: Tue, 7 Feb 2017 11:18:27 -0800
From: Eric Dumazet <edumazet@...gle.com>
To: Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc: Eric Dumazet <eric.dumazet@...il.com>,
Tariq Toukan <ttoukan.linux@...il.com>,
"David S . Miller" <davem@...emloft.net>,
netdev <netdev@...r.kernel.org>,
Tariq Toukan <tariqt@...lanox.com>,
Martin KaFai Lau <kafai@...com>,
Willem de Bruijn <willemb@...gle.com>,
Jesper Dangaard Brouer <brouer@...hat.com>,
Brenden Blanco <bblanco@...mgrid.com>,
Alexei Starovoitov <ast@...nel.org>
Subject: Re: [PATCH net-next 0/9] mlx4: order-0 allocations and page recycling
On Tue, Feb 7, 2017 at 11:05 AM, Alexei Starovoitov
<alexei.starovoitov@...il.com> wrote:
> On Tue, Feb 07, 2017 at 08:26:23AM -0800, Eric Dumazet wrote:
>> On Tue, 2017-02-07 at 08:06 -0800, Eric Dumazet wrote:
>>
>
> Awesome that you've started working on this. I think it's correct approach
> and mlx5 should be cleaned up in similar way.
> Long term we should be able to move all page alloc/free out of the drivers
> completely.
>
>> > /*
>> > * make sure we read the CQE after we read the ownership bit
>> > */
>> > dma_rmb();
>> > + prefetch(frags[0].page);
>>
>> Note that I would like to instead do a prefetch(frags[1].page)
>
> yeah, this two look weird:
> + prefetch(frags[0].page);
> + va = page_address(frags[0].page) + frags[0].page_offset;
>
> on most archs page_addres() is just math (not a load from memory),
> but the result != frags[0].page, so I'm missing what are you trying to prefetch?
>
> prefetch(frags[1].page)
> is even more confusing. what will it prefetch?
The "struct page" of the following frame
Remember we need :
release = page_count(page) != 1 ||
page_is_pfmemalloc(page) ||
page_to_nid(page) != numa_mem_id();
Then :
page_ref_inc(page);
My patch now does :
prefetch(frags[priv->num_frags].page);
>
> btw we had a patch that was doing prefetch of 'va' of next packet
> and it was very helpful. Like this:
I preferred to fetch the second cache line of this frame,
because TCP is mostly used with timestamps : total of 66 bytes of
header with IPv4, and more for IPV6 of course.
> pref_index = (index + 1) & ring->size_mask;
> pref = ring->rx_info + (pref_index << priv->log_rx_info);
> prefetch(page_address(pref->page) + pref->page_offset);
>
> but since you're redesigning rxing->rx_info... not sure how will it fit.
>
>> So I will probably change how ring->rx_info is allocated
>>
>> wasting all that space and forcing vmalloc() is silly :
>>
>> tmp = size * roundup_pow_of_two(MLX4_EN_MAX_RX_FRAGS *
>> sizeof(struct mlx4_en_rx_alloc));
>
> I think you'd still need roundup_pow_of_two otherwise priv->log_rx_info
> optimization won't work.
No more log_rx_info trick.
Simply : frags = priv->rx_info + (index * priv->rx_info_bytes_per_slot);
A multiply is damn fast these days compared to cache misses.
Using 24*<rx_ring_size> bytes is better than 32*<rx_ring_size>, our
L1/L2 caches are quite small.
Of course, this applies to the 'stress' mode, not the light mode where
we receive one single packet per IRQ.
Powered by blists - more mailing lists