[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0BA3FCBA62E2DC44AF3030971E174FB32EA05D67@hasmsx107.ger.corp.intel.com>
Date: Sun, 7 Feb 2016 12:39:18 +0000
From: "Grumbach, Emmanuel" <emmanuel.grumbach@...el.com>
To: Dave Taht <dave.taht@...il.com>
CC: linux-wireless <linux-wireless@...r.kernel.org>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: [RFC RESEND] iwlwifi: pcie: transmit queue auto-sizing
On 02/05/2016 12:06 AM, Dave Taht wrote:
> I am not on linux-wireless nor netdev presently, but...
>
> On Thu, Feb 4, 2016 at 12:16 PM, Emmanuel Grumbach
> <emmanuel.grumbach@...el.com> wrote:
>> As many (all?) WiFi devices, Intel WiFi devices have
>> transmit queues which have 256 transmit descriptors
>> each and each descriptor corresponds to an MPDU.
>> This means that when it is full, the queue contains
>> 256 * ~1500 bytes to be transmitted (if we don't have
>> A-MSDUs). The purpose of those queues is to have enough
>> packets to be ready for transmission so that when the device
>> gets an opportunity to transmit (TxOP), it can take as many
>> packets as the spec allows and aggregate them into one
>> A-MPDU or even several A-MPDUs if we are using bursts.
>>
>> The problem is that the packets that are in these queues are
>> already out of control of the Qdisc and can stay in those
>> queues for a fairly long time when the link condition is
>> not good. This leads to the well known bufferbloat problem.
>>
>> This patch adds a way to tune the size of the transmit queue
>> so that it won't cause excessive latency. When the link
>> condition is good, the packets will flow smoothly and the
>> transmit queue will grow quickly allowing A-MPDUs and
>> maximal throughput. When the link is not optimal, we will
>> have retransmissions, lower transmit rates or signal
>> detection (CCA) which will cause a delay in the packet
>> transmission. The driver will sense this higher latency
>> and will reduce the size of the transmit queue.
>> This means that the packets that continue to arrive
>> will pile up in the Qdisc rather than in the device
>> queues. The major advantage of this approach is that
>> codel can now do its work.
>>
>> The algorithm is really (too?) simple:
>> every 5 seconds, starts from a short queue again.
>> If a packet has been in the queue for less than 10ms,
>> allow 10 more MPDUs in.
>> If a packet has been in the queue for more than 20ms,
>> reduce by 10 the size of the transmit queue.
>>
>> The implementation is really naive and way too simple:
>> * reading jiffies for every Tx / Tx status is not a
>> good idead.
>> * jiffies are not fine-grained enough on all platforms
>> * the constants chosen are really arbitrary and can't be
>> tuned.
>> * This may be implemented in mac80211 probably and help
>> other drivers.
>> * etc...
>>
>> But already this gives nice results. I ran a very simple
>> experiment: I put the device in a controlled environment
>> and ran traffic while running default sized ping in the
>> background. In this configuration, our device quickly
>> raises its transmission rate to the best rate.
>> Then, I force the device to use the lowest rate (6Mbps).
>> Of course, the throughput collapses, but the ping RTT
>> shoots up.
>> Using codel helps, but the latency is still high. Codel
>> with this patch gives much better results:
>>
>> pfifo_fast:
>> rtt min/avg/max/mdev = 1932.616/2393.284/2833.407/315.941 ms, pipe 3, ipg/ewma 2215.707/2446.884 ms
>>
>> fq_codel + Tx queue auto-sizing:
>> rtt min/avg/max/mdev = 13.541/32.396/54.791/9.610 ms, ipg/ewma 200.685/32.202 ms
>>
>> fq_codel without Tx queue auto-sizing:
>> rtt min/avg/max/mdev = 140.821/257.303/331.889/31.074 ms, pipe 2, ipg/ewma 258.147/252.847 ms
> This is a dramatic improvement. But I'm not sure what you are
> measuring. Is this the 6mbit test? What happens when you send traffic
> the other way (more pure acks, rather than big packets?)
Not tested. I only tested the part that I thought was most interesting:
lots of TCP traffic (charriot) with a very low rate and ping in the
background.
>
> I try to encourage folk to use flent whenever possible, for pretty
> graphs and long term measurements, so you can simultaneously measure
> both throughput and latency.
>
> flent.org's .14 release just shipped.
Ok - I hope I'll get some time to give it a try.
>> Clearly, there is more work to do to be able to merge this,
>> but it seems that the wireless problems mentioned in
>> https://lwn.net/Articles/616241/ may have a solution.
> I gave talks on the problems that wifi had with bufferbloat at the
> ieee 802.11 wg meeting a while back, and more recently it was filmed
> at battlemesh.
>
> https://www.youtube.com/watch?v=Rb-UnHDw02o
>
> I have spent my time since trying to raise sufficient resources
> (testbeds and test tools), orgs, people and money to tackle these
> problems at more depth. We made a bit of progress recently which I can
> talk about offline...
>
> In that talk I suggested that overall we move towards timestamping
> everything, that (at least in the case of the ath9k and mt72) we tie
> together aggregation with a byte based estimator similar to how BQL
> works, and I hoped that eventually - we'd be able to basically - at
> low rates, keep no more than one aggregate in the hardware, one in the
> driver queue, and one being assembled. The pending aggregate would be
> sent to the hardware on the completion interrupt for the previous
> aggregate, which would fire off the size estimator and start
> aggrefying the one being assembled.
>
> A hook to do that is in use on the mt72 chipset that felix is working
> on... but nowhere else so far as I know (as yet).
>
> the iwl does it's own aggregation (I think(?))... but estimates can
> still be made...
>
> There are WAY more details of course - per station queuing, a separate
> multicast queue, only some in that talk!, but my hope was that under
> good conditions we'd get wireless-n down below 12ms driver overhead,
> even at 6mbit, before something like fq_codel could kick in (under
> good conditions! Plenty of other potential latency sources beside
> excessive queuing in wifi!). My ideal world would be to hold it at
> under 1250us at higher rates....
>
> Periodically sampling seems like a reasonable approach under lab
> conditions but it would be nicer to have feedback from the firmware -
> "I transmitted the last tx as an X byte aggregate, at MCS1, I had to
> retransmit a few packets once, it took me 6ms to acquire the media, I
> heard 3 other stations transmitting, etc.".
>
> The above info we know we can get from a few chipsets, but not enough
> was known about the iwl last I looked. And one reason why fq_codel -
> unassisted - is not quite the right thing on top of this is that
> multicast can take a really long time...
>
> Regardless, I'd highly love to see/use this patch myself in a variety
> of real world conditions and see what happens. And incremental
> progress is the only way forward. Thx for cheering me up.
>
>> Cc: Stephen Hemminger <stephen@...workplumber.org>
>> Cc: Dave Taht <dave.taht@...il.com>
>> Cc: Jonathan Corbet <corbet@....net>
>> Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@...el.com>
>> ---
>> Fix Dave's email address
>> ---
>> drivers/net/wireless/intel/iwlwifi/pcie/internal.h | 6 ++++
>> drivers/net/wireless/intel/iwlwifi/pcie/tx.c | 32 ++++++++++++++++++++--
>> 2 files changed, 35 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/net/wireless/intel/iwlwifi/pcie/internal.h b/drivers/net/wireless/intel/iwlwifi/pcie/internal.h
>> index 2f95916..d83eb56 100644
>> --- a/drivers/net/wireless/intel/iwlwifi/pcie/internal.h
>> +++ b/drivers/net/wireless/intel/iwlwifi/pcie/internal.h
>> @@ -192,6 +192,11 @@ struct iwl_cmd_meta {
>> u32 flags;
>> };
>>
>> +struct iwl_txq_auto_size {
>> + int min_space;
>> + unsigned long reset_ts;
>> +};
>> +
>> /*
>> * Generic queue structure
>> *
>> @@ -293,6 +298,7 @@ struct iwl_txq {
>> bool block;
>> unsigned long wd_timeout;
>> struct sk_buff_head overflow_q;
>> + struct iwl_txq_auto_size auto_sz;
>> };
>>
>> static inline dma_addr_t
>> diff --git a/drivers/net/wireless/intel/iwlwifi/pcie/tx.c b/drivers/net/wireless/intel/iwlwifi/pcie/tx.c
>> index 837a7d5..4d1dee6 100644
>> --- a/drivers/net/wireless/intel/iwlwifi/pcie/tx.c
>> +++ b/drivers/net/wireless/intel/iwlwifi/pcie/tx.c
>> @@ -572,6 +572,8 @@ static int iwl_pcie_txq_init(struct iwl_trans *trans, struct iwl_txq *txq,
>>
>> spin_lock_init(&txq->lock);
>> __skb_queue_head_init(&txq->overflow_q);
>> + txq->auto_sz.min_space = 240;
>> + txq->auto_sz.reset_ts = jiffies;
>>
>> /*
>> * Tell nic where to find circular buffer of Tx Frame Descriptors for
>> @@ -1043,10 +1045,14 @@ void iwl_trans_pcie_reclaim(struct iwl_trans *trans, int txq_id, int ssn,
>> q->read_ptr != tfd_num;
>> q->read_ptr = iwl_queue_inc_wrap(q->read_ptr)) {
>> struct sk_buff *skb = txq->entries[txq->q.read_ptr].skb;
>> + struct ieee80211_tx_info *info;
>> + unsigned long tx_time;
>>
>> if (WARN_ON_ONCE(!skb))
>> continue;
>>
>> + info = IEEE80211_SKB_CB(skb);
>> +
>> iwl_pcie_free_tso_page(skb);
>>
>> __skb_queue_tail(skbs, skb);
>> @@ -1056,6 +1062,18 @@ void iwl_trans_pcie_reclaim(struct iwl_trans *trans, int txq_id, int ssn,
>> iwl_pcie_txq_inval_byte_cnt_tbl(trans, txq);
>>
>> iwl_pcie_txq_free_tfd(trans, txq);
>> +
>> + tx_time = (uintptr_t)info->driver_data[IWL_TRANS_FIRST_DRIVER_DATA + 2];
>> + if (time_before(jiffies, tx_time + msecs_to_jiffies(10))) {
>> + txq->auto_sz.min_space -= 10;
>> + txq->auto_sz.min_space =
>> + max(txq->auto_sz.min_space, txq->q.high_mark);
>> + } else if (time_after(jiffies,
>> + tx_time + msecs_to_jiffies(20))) {
>> + txq->auto_sz.min_space += 10;
>> + txq->auto_sz.min_space =
>> + min(txq->auto_sz.min_space, 252);
>> + }
>> }
>>
>> iwl_pcie_txq_progress(txq);
>> @@ -2185,6 +2203,7 @@ int iwl_trans_pcie_tx(struct iwl_trans *trans, struct sk_buff *skb,
>> struct iwl_device_cmd *dev_cmd, int txq_id)
>> {
>> struct iwl_trans_pcie *trans_pcie = IWL_TRANS_GET_PCIE_TRANS(trans);
>> + struct ieee80211_tx_info *info = IEEE80211_SKB_CB(skb);
>> struct ieee80211_hdr *hdr;
>> struct iwl_tx_cmd *tx_cmd = (struct iwl_tx_cmd *)dev_cmd->payload;
>> struct iwl_cmd_meta *out_meta;
>> @@ -2234,13 +2253,20 @@ int iwl_trans_pcie_tx(struct iwl_trans *trans, struct sk_buff *skb,
>>
>> spin_lock(&txq->lock);
>>
>> - if (iwl_queue_space(q) < q->high_mark) {
>> + info->driver_data[IWL_TRANS_FIRST_DRIVER_DATA + 2] =
>> + (void *)(uintptr_t)jiffies;
>> +
>> + if (time_after(jiffies,
>> + txq->auto_sz.reset_ts + msecs_to_jiffies(5000))) {
>> + txq->auto_sz.min_space = 240;
>> + txq->auto_sz.reset_ts = jiffies;
>> + }
>> +
>> + if (iwl_queue_space(q) < txq->auto_sz.min_space) {
>> iwl_stop_queue(trans, txq);
>>
>> /* don't put the packet on the ring, if there is no room */
>> if (unlikely(iwl_queue_space(q) < 3)) {
>> - struct ieee80211_tx_info *info = IEEE80211_SKB_CB(skb);
>> -
>> info->driver_data[IWL_TRANS_FIRST_DRIVER_DATA + 1] =
>> dev_cmd;
>> __skb_queue_tail(&txq->overflow_q, skb);
>> --
>> 2.5.0
> =
> Dave Täht
> Let's go make home routers and wifi faster! With better software!
> https://www.gofundme.com/savewifi
>
Powered by blists - more mailing lists