[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250210222812.1d0479a4@pumpkin>
Date: Mon, 10 Feb 2025 22:28:12 +0000
From: David Laight <david.laight.linux@...il.com>
To: Alexander Lobakin <aleksander.lobakin@...el.com>
Cc: Andrew Lunn <andrew+netdev@...n.ch>, "David S. Miller"
<davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski
<kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, Alexei Starovoitov
<ast@...nel.org>, Daniel Borkmann <daniel@...earbox.net>, John Fastabend
<john.fastabend@...il.com>, Andrii Nakryiko <andrii@...nel.org>, "Jose E.
Marchesi" <jose.marchesi@...cle.com>, Toke Høiland-Jørgensen <toke@...hat.com>, Magnus Karlsson
<magnus.karlsson@...el.com>, Maciej Fijalkowski
<maciej.fijalkowski@...el.com>, Przemek Kitszel
<przemyslaw.kitszel@...el.com>, Jason Baron <jbaron@...mai.com>, Casey
Schaufler <casey@...aufler-ca.com>, Nathan Chancellor <nathan@...nel.org>,
bpf@...r.kernel.org, netdev@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH net-next 2/4] i40e: use generic unrolled_count() macro
On Thu, 6 Feb 2025 19:26:27 +0100
Alexander Lobakin <aleksander.lobakin@...el.com> wrote:
> i40e, as well as ice, has a custom loop unrolling macro for unrolling
> Tx descriptors filling on XSk xmit.
> Replace i40e defs with generic unrolled_count(), which is also more
> convenient as it allows passing defines as its argument, not hardcoded
> values, while the loop declaration will still be a usual for-loop.
..
> #define PKTS_PER_BATCH 4
>
> -#ifdef __clang__
> -#define loop_unrolled_for _Pragma("clang loop unroll_count(4)") for
> -#elif __GNUC__ >= 8
> -#define loop_unrolled_for _Pragma("GCC unroll 4") for
> -#else
> -#define loop_unrolled_for for
> -#endif
...
> @@ -529,7 +530,8 @@ static void i40e_xmit_pkt_batch(struct i40e_ring *xdp_ring, struct xdp_desc *des
> dma_addr_t dma;
> u32 i;
>
> - loop_unrolled_for(i = 0; i < PKTS_PER_BATCH; i++) {
> + unrolled_count(PKTS_PER_BATCH)
> + for (i = 0; i < PKTS_PER_BATCH; i++) {
> u32 cmd = I40E_TX_DESC_CMD_ICRC | xsk_is_eop_desc(&desc[i]);
>
> dma = xsk_buff_raw_get_dma(xdp_ring->xsk_pool, desc[i].addr);
The rest of that code is:
tx_desc = I40E_TX_DESC(xdp_ring, ntu++);
tx_desc->buffer_addr = cpu_to_le64(dma);
tx_desc->cmd_type_offset_bsz = build_ctob(cmd, 0, desc[i].len, 0);
*total_bytes += desc[i].len;
}
xdp_ring->next_to_use = ntu;
}
static void i40e_fill_tx_hw_ring(struct i40e_ring *xdp_ring, struct xdp_desc *descs, u32 nb_pkts,
unsigned int *total_bytes)
{
u32 batched, leftover, i;
batched = nb_pkts & ~(PKTS_PER_BATCH - 1);
leftover = nb_pkts & (PKTS_PER_BATCH - 1);
for (i = 0; i < batched; i += PKTS_PER_BATCH)
i40e_xmit_pkt_batch(xdp_ring, &descs[i], total_bytes);
for (i = batched; i < batched + leftover; i++)
i40e_xmit_pkt(xdp_ring, &descs[i], total_bytes);
}
If it isn't a silly question why all the faffing with unrolling?
It isn't as though the loop body is trivial - it contains real function calls.
Unrolling loops is so 1980s - unless you are trying to get the absolute
max performance from a very short loop and need to unroll once (maybe twice)
to get enough spare instruction execution slots to run the loop control
code in parallel with the body.
In this case it looks like the 'batched' loop contains an inlined copy of
the function called for the remainder.
I can't see anything else.
You'd probably gain more by getting rid of the 'int *total bytes' and using
the function return value - that is what it is fot.
David
Powered by blists - more mailing lists