[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <pj41zlk0qod0rh.fsf@u68c7b5b1d2d758.ant.amazon.com>
Date: Wed, 3 Mar 2021 13:29:01 +0200
From: Shay Agroskin <shayagr@...zon.com>
To: Jesper Dangaard Brouer <brouer@...hat.com>
CC: Lorenzo Bianconi <lorenzo.bianconi@...hat.com>,
Lorenzo Bianconi <lorenzo@...nel.org>, <bpf@...r.kernel.org>,
<netdev@...r.kernel.org>, <davem@...emloft.net>, <kuba@...nel.org>,
<ast@...nel.org>, <daniel@...earbox.net>, <toke@...hat.com>,
<freysteinn.alfredsson@....se>, <john.fastabend@...il.com>,
<jasowang@...hat.com>, <mst@...hat.com>,
<thomas.petazzoni@...tlin.com>, <mw@...ihalf.com>,
<linux@...linux.org.uk>, <ilias.apalodimas@...aro.org>,
<netanel@...zon.com>, <akiyano@...zon.com>,
<michael.chan@...adcom.com>, <madalin.bucur@....com>,
<ioana.ciornei@....com>, <jesse.brandeburg@...el.com>,
<anthony.l.nguyen@...el.com>, <saeedm@...dia.com>,
<grygorii.strashko@...com>, <ecree.xilinx@...il.com>
Subject: Re: [PATCH v2 bpf-next] bpf: devmap: move drop error path to devmap
for XDP_REDIRECT
Jesper Dangaard Brouer <brouer@...hat.com> writes:
> On Mon, 1 Mar 2021 13:23:06 +0200
> Shay Agroskin <shayagr@...zon.com> wrote:
>
>> Jesper Dangaard Brouer <brouer@...hat.com> writes:
>>
>> > On Sun, 28 Feb 2021 23:27:25 +0100
>> > Lorenzo Bianconi <lorenzo.bianconi@...hat.com> wrote:
>> >
>> >> > > drops = bq->count - sent;
>> >> > > -out:
>> >> > > - bq->count = 0;
>> >> > > + if (unlikely(drops > 0)) {
>> >> > > + /* If not all frames have been
>> >> > > transmitted, it is our
>> >> > > + * responsibility to free them
>> >> > > + */
>> >> > > + for (i = sent; i < bq->count; i++)
>> >> > > +
>> >> > > xdp_return_frame_rx_napi(bq->q[i]);
>> >> > > + }
>> >> >
>> >> > Wouldn't the logic above be the same even w/o the 'if'
>> >> > condition ?
>> >>
>> >> it is just an optimization to avoid the for loop instruction
>> >> if
>> >> sent = bq->count
>> >
>> > True, and I like this optimization.
>> > It will affect how the code layout is (and thereby I-cache
>> > usage).
>>
>> I'm not sure what I-cache optimization you mean here. Compiling
>> the following C code:
>>
>> # define unlikely(x) __builtin_expect(!!(x), 0)
>>
>> extern void xdp_return_frame_rx_napi(int q);
>>
>> struct bq_stuff {
>> int q[4];
>> int count;
>> };
>>
>> int test(int sent, struct bq_stuff *bq) {
>> int i;
>> int drops;
>>
>> drops = bq->count - sent;
>> if(unlikely(drops > 0))
>> for (i = sent; i < bq->count; i++)
>> xdp_return_frame_rx_napi(bq->q[i]);
>>
>> return 2;
>> }
>>
>> with x86_64 gcc 10.2 with -O3 flag in https://godbolt.org/
>> (which
>> provides the assembly code for different compilers) yields the
>> following assembly:
>>
>> test:
>> mov eax, DWORD PTR [rsi+16]
>> mov edx, eax
>> sub edx, edi
>> test edx, edx
>> jg .L10
>> .L6:
>> mov eax, 2
>> ret
>
> This exactly shows my point. Notice how 'ret' happens earlier
> in this
> function. This is the common case, thus the CPU don't have to
> load the
> asm instruction below.
>
Wasn't aware of that. I'll dig into it
>> .L10:
>> cmp eax, edi
>> jle .L6
>> push rbp
>> mov rbp, rsi
>> push rbx
>> movsx rbx, edi
>> sub rsp, 8
>> .L3:
>> mov edi, DWORD PTR [rbp+0+rbx*4]
>> add rbx, 1
>> call xdp_return_frame_rx_napi
>> cmp DWORD PTR [rbp+16], ebx
>> jg .L3
>> add rsp, 8
>> mov eax, 2
>> pop rbx
>> pop rbp
>> ret
>>
>>
>> When dropping the 'if' completely I get the following assembly
>> output
>> test:
>> cmp edi, DWORD PTR [rsi+16]
>> jge .L6
>
> Jump to .L6 which is the common case. The code in between is
> not used
> in common case, but the CPU will likely load this into I-cache,
> and
> then jumps over the code in common case.
>
>> push rbp
>> mov rbp, rsi
>> push rbx
>> movsx rbx, edi
>> sub rsp, 8
>> .L3:
>> mov edi, DWORD PTR [rbp+0+rbx*4]
>> add rbx, 1
>> call xdp_return_frame_rx_napi
>> cmp DWORD PTR [rbp+16], ebx
>> jg .L3
>> add rsp, 8
>> mov eax, 2
>> pop rbx
>> pop rbp
>> ret
>> .L6:
>> mov eax, 2
>> ret
>>
>> which exits earlier from the function if 'drops > 0' compared
>> to
>> the original code (the 'for' loop looks a little different, but
>> this shouldn't affect icache).
>>
>> When removing the 'if' and surrounding the 'for' condition with
>> 'unlikely' statement:
>>
>> for (i = sent; unlikely(i < bq->count); i++)
>>
>> I get the following assembly code:
>>
>> test:
>> cmp edi, DWORD PTR [rsi+16]
>> jl .L10
>> mov eax, 2
>> ret
>> .L10:
>> push rbx
>> movsx rbx, edi
>> sub rsp, 16
>> .L3:
>> mov edi, DWORD PTR [rsi+rbx*4]
>> mov QWORD PTR [rsp+8], rsi
>> add rbx, 1
>> call xdp_return_frame_rx_napi
>> mov rsi, QWORD PTR [rsp+8]
>> cmp DWORD PTR [rsi+16], ebx
>> jg .L3
>> add rsp, 16
>> mov eax, 2
>> pop rbx
>> ret
>>
>> which is shorter than the other two (one line compared to the
>> second and 7 lines compared the original code) and seems as
>> optimized as the second.
>
> You are also using unlikely() and get the earlier return, with
> less
> instructions, which is great. Perhaps we can use this type of
> unlikely() in the for-statement? WDYT Lorenzo?
>
>
Thank you for this detail explanation (: Learned a lot from it.
I'd rather remove the 'if' if we can use 'for' and 'unlikely'. I
think it looks prettier.
Shay
>> I'm far from being an assembly expert, and I tested a code
>> snippet
>> I wrote myself rather than the kernel's code (for the sake of
>> simplicity only).
>> Can you please elaborate on what makes the original 'if'
>> essential
>> (I took the time to do the assembly tests, please take the time
>> on
>> your side to prove your point, I'm not trying to be grumpy
>> here).
>>
>> Shay
Powered by blists - more mailing lists