[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5153369.PcVDn1cGQl@wuerfel>
Date: Thu, 03 Apr 2014 19:57:53 +0200
From: Arnd Bergmann <arnd@...db.de>
To: Russell King - ARM Linux <linux@....linux.org.uk>
Cc: Zhangfei Gao <zhangfei.gao@...aro.org>, davem@...emloft.net,
f.fainelli@...il.com, sergei.shtylyov@...entembedded.com,
mark.rutland@....com, David.Laight@...lab.com,
eric.dumazet@...il.com, linux-arm-kernel@...ts.infradead.org,
netdev@...r.kernel.org, devicetree@...r.kernel.org
Subject: Re: [PATCH 3/3] net: hisilicon: new hip04 ethernet driver
On Thursday 03 April 2014 16:27:46 Russell King - ARM Linux wrote:
> On Wed, Apr 02, 2014 at 11:21:45AM +0200, Arnd Bergmann wrote:
> > - As David Laight pointed out earlier, you must also ensure that
> > you don't have too much /data/ pending in the descriptor ring
> > when you stop the queue. For a 10mbit connection, you have already
> > tested (as we discussed on IRC) that 64 descriptors with 1500 byte
> > frames gives you a 68ms round-trip ping time, which is too much.
> > Conversely, on 1gbit, having only 64 descriptors actually seems
> > a little low, and you may be able to get better throughput if
> > you extend the ring to e.g. 512 descriptors.
>
> You don't manage that by stopping the queue - there's separate interfaces
> where you report how many bytes you've queued (netdev_sent_queue()) and
> how many bytes/packets you've sent (netdev_tx_completed_queue()). This
> allows the netdev schedulers to limit how much data is held in the queue,
> preserving interactivity while allowing the advantages of larger rings.
Ah, I didn't know about these. However, reading through the dql code,
it seems that will not work if the tx reclaim is triggered by a timer,
since it expects to get feedback from the actual hardware behavior. :(
I guess this is (part of) what David Miller also meant by saying it won't
ever work properly.
> > > + phys = dma_map_single(&ndev->dev, skb->data, skb->len, DMA_TO_DEVICE);
> > > + if (dma_mapping_error(&ndev->dev, phys)) {
> > > + dev_kfree_skb(skb);
> > > + return NETDEV_TX_OK;
> > > + }
> > > +
> > > + priv->tx_skb[tx_head] = skb;
> > > + priv->tx_phys[tx_head] = phys;
> > > + desc->send_addr = cpu_to_be32(phys);
> > > + desc->send_size = cpu_to_be16(skb->len);
> > > + desc->cfg = cpu_to_be32(DESC_DEF_CFG);
> > > + phys = priv->tx_desc_dma + tx_head * sizeof(struct tx_desc);
> > > + desc->wb_addr = cpu_to_be32(phys);
> >
> > One detail: since you don't have cache-coherent DMA, "desc" will
> > reside in uncached memory, so you try to minimize the number of accesses.
> > It's probably faster if you build the descriptor on the stack and
> > then atomically copy it over, rather than assigning each member at
> > a time.
>
> DMA coherent memory is write combining, so multiple writes will be
> coalesced. This also means that barriers may be required to ensure the
> descriptors are pushed out in a timely manner if something like writel()
> is not used in the transmit-triggering path.
Right, makes sense. There is a writel() right after this, so no need
for extra barriers. We already concluded that the store operation on
uncached memory isn't actually a problem, and Zhangfei Gao did some
measurements to check the overhead of the one read from uncached
memory that is in the tx path, which was lost in the noise.
Arnd
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists