[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKgT0Ufsk8Uf--nfHwBmHKqNWgDpckVRT5pT47P4mOuOe-Qj4w@mail.gmail.com>
Date: Thu, 17 Nov 2016 14:39:08 -0800
From: Alexander Duyck <alexander.duyck@...il.com>
To: David Laight <David.Laight@...lab.com>
Cc: Jesper Dangaard Brouer <brouer@...hat.com>,
Eric Dumazet <eric.dumazet@...il.com>,
Rick Jones <rick.jones2@....com>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: Netperf UDP issue with connected sockets
On Thu, Nov 17, 2016 at 9:34 AM, David Laight <David.Laight@...lab.com> wrote:
> From: Jesper Dangaard Brouer
>> Sent: 17 November 2016 14:58
>> On Thu, 17 Nov 2016 06:17:38 -0800
>> Eric Dumazet <eric.dumazet@...il.com> wrote:
>>
>> > On Thu, 2016-11-17 at 14:42 +0100, Jesper Dangaard Brouer wrote:
>> >
>> > > I can see that qdisc layer does not activate xmit_more in this case.
>> > >
>> >
>> > Sure. Not enough pressure from the sender(s).
>> >
>> > The bottleneck is not the NIC or qdisc in your case, meaning that BQL
>> > limit is kept at a small value.
>> >
>> > (BTW not all NIC have expensive doorbells)
>>
>> I believe this NIC mlx5 (50G edition) does.
>>
>> I'm seeing UDP TX of 1656017.55 pps, which is per packet:
>> 2414 cycles(tsc) 603.86 ns
>>
>> Perf top shows (with my own udp_flood, that avoids __ip_select_ident):
>>
>> Samples: 56K of event 'cycles', Event count (approx.): 51613832267
>> Overhead Command Shared Object Symbol
>> + 8.92% udp_flood [kernel.vmlinux] [k] _raw_spin_lock
>> - _raw_spin_lock
>> + 90.78% __dev_queue_xmit
>> + 7.83% dev_queue_xmit
>> + 1.30% ___slab_alloc
>> + 5.59% udp_flood [kernel.vmlinux] [k] skb_set_owner_w
>> + 4.77% udp_flood [mlx5_core] [k] mlx5e_sq_xmit
>> + 4.09% udp_flood [kernel.vmlinux] [k] fib_table_lookup
>> + 4.00% swapper [mlx5_core] [k] mlx5e_poll_tx_cq
>> + 3.11% udp_flood [kernel.vmlinux] [k] __ip_route_output_key_hash
>> + 2.49% swapper [kernel.vmlinux] [k] __slab_free
>>
>> In this setup the spinlock in __dev_queue_xmit should be uncongested.
>> An uncongested spin_lock+unlock cost 32 cycles(tsc) 8.198 ns on this system.
>>
>> But 8.92% of the time is spend on it, which corresponds to a cost of 215
>> cycles (2414*0.0892). This cost is too high, thus something else is
>> going on... I claim this mysterious extra cost is the tailptr/doorbell.
>
> Try adding code to ring the doorbell twice.
> If this doesn't slow things down then it isn't (likely to be) responsible
> for the delay you are seeing.
>
> David
>
The problem isn't only the doorbell. It is doorbell plus a locked
transaction on x86 results in a long wait until the doorbell write has
been completed.
You could batch a bunch of doorbell writes together and it isn't an
issue unless you do something like writel(), wmb(), writel(), wmb(),
then you will see the effect double since the write memory barrier is
what is forcing the delays.
- Alex
Powered by blists - more mailing lists