[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <AM0PR04MB4994571246CB1708B79A986594C50@AM0PR04MB4994.eurprd04.prod.outlook.com>
Date: Thu, 8 Nov 2018 20:21:15 +0000
From: Ioana Ciocoi Radulescu <ruxandra.radulescu@....com>
To: David Miller <davem@...emloft.net>
CC: "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
Ioana Ciornei <ioana.ciornei@....com>
Subject: RE: [PATCH net-next] dpaa2-eth: Introduce TX congestion management
> -----Original Message-----
> From: David Miller <davem@...emloft.net>
> Sent: Thursday, November 8, 2018 8:08 AM
> To: Ioana Ciocoi Radulescu <ruxandra.radulescu@....com>
> Cc: netdev@...r.kernel.org; Ioana Ciornei <ioana.ciornei@....com>
> Subject: Re: [PATCH net-next] dpaa2-eth: Introduce TX congestion
> management
>
> From: Ioana Ciocoi Radulescu <ruxandra.radulescu@....com>
> Date: Wed, 7 Nov 2018 10:31:16 +0000
>
> > We chose this mechanism over BQL (to which it is conceptually
> > very similar) because a) we can take advantage of the hardware
> > offloading and b) BQL doesn't match well with our driver fastpath
> > (we process ingress (Rx or Tx conf) frames in batches of up to 16,
> > which in certain scenarios confuses the BQL adaptive algorithm,
> > resulting in too low values of the limit and low performance).
>
> First, this kind of explanation belongs in the commit message.
>
> Second, you'll have to describe better what BQL, which is the
> ultimate standard mechanism for every single driver in the
> kernel to deal with this issue.
>
> Are you saying that if 15 TX frames are pending, not TX interrupt
> will arrive at all?
I meant up to 16, not exactly 16.
>
> There absolutely must be some timeout or similar interrupt that gets
> sent in that kind of situation. You cannot leave stale TX packets
> on your ring unprocessed just because a non-multiple of 16 packets
> were queued up and then TX activity stopped.
I'll try to detail a bit how our hardware works, since it's not the standard
ring-based architecture.
In our driver implementation, we have multiple ingress queues (for Rx frames
and Tx confirmation frames) grouped into core-affine channels. Each channel
may contain one or more queues of each type.
When queues inside a channel have available frames, the hardware triggers
a notification for us; we then issue a dequeue command for that channel,
in response to which hardware will write a number of frames (between 1 and
16) at a memory location of our choice. Frames obtained as a result of one
dequeue operation are all from the same queue, but consecutive dequeues
on the same channel may service different queues.
So my initial attempt at implementing BQL called netdev_tx_completed_queue()
for each batch of Tx confirmation frames obtained from a single dequeue
operation. I don't fully understand the BQL algorithm yet, but I think it had
a problem with the fact that we never reported as completed more than
16 frames at a time.
Consider a scenario where a DPAA2 platform does IP forwarding from port1
to port2; to keep things simple, let's say we're single core and have only
one Rx and one Tx confirmation queue per interface.
Without BQL, the egress interface can transmit up to 64 frames at a time
(one for each Rx frame processed by the ingress interface in its NAPI poll)
and then process all 64 confirmations (grouped in 4 dequeues) during its
own NAPI cycle.
With BQL support added, the number of consecutive transmitted frames is
never higher than 33; after that, BQL stops the netdev tx queue until
we mark some of those trasmits as completed. This results in a performance
drop of over 30%.
Today I tried to further coalesce the confirmation frames such that I call
netdev_tx_completed_queue() only at the end of the NAPI poll, once for each
confirmation queue that was serviced during that NAPI. I need to do more
testing, but so far it performs *almost* on par with the non-BQL driver
version. But it does complicate the fastpath code and feels somewhat
unnatural.
So I would still prefer using the hardware mechanism, which I think fits
more smoothly with both our hardware and driver implementation, and adds
less overhead. But if you feel strongly about it I can drop this patch and
polish the BQL support until it's looks decent enough (and hopefully doesn't
mess too much with our performance benchmarks).
Thanks,
Ioana
Powered by blists - more mailing lists