lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 26 Mar 2012 09:32:37 +0100
From:	David Woodhouse <dwmw2@...radead.org>
To:	David Miller <davem@...emloft.net>
Cc:	netdev@...r.kernel.org
Subject: Re: [STRAW MAN PATCH] sch_teql doesn't load-balance ppp(oatm) slaves

On Sun, 2012-03-25 at 17:36 -0400, David Miller wrote:
> From: David Woodhouse <dwmw2@...radead.org>
> Date: Sun, 25 Mar 2012 11:43:50 +0100
> 
> > It's a bad idea to have huge hidden queues (a whole wmem_default worth
> > of packets are in a hidden queue between ppp_generic and the ATM device,
> > ffs!) anyway, so perhaps if we just fix *that* within PPP, it should
> > work a bit better with TEQL?
> 
> Yes, the ATM devices deep transmit queue is quite undesirable.

Indeed, although I don't think it's the only cause of the problem I saw.

The first thing I tried was a hack in ppoatm_assign_vcc() to set the
socket's sk_sndbuf to 4KiB. It *seemed* to work, but only while all my
debugging printks in sch_teql were being spewed at 115200 baud over the
serial port. As soon as I hit SysRq-0 and the serial port delays went
away, I was back to bursts on one line then the other.

> But I actually don't see how the problem arises yet, I need more
> details.
>
> PPP itself will always stop the queue, and return NETDEV_TX_OK on a
> transmit attempt.  It may wake the queue back up before returning if
> the downstream device (such as pppoatm) accepted the packet.

It does indeed stop the queue. I think it then wakes it right back up
again in ppp_xmit_process(), *before* returning NETDEV_TX_OK. So the
offending calls to skb_dequeue() which are putting it back to the front
of the list are going to be from the softirq trying to feed the device.

I'll confirm that, then try fixing the PPP code so it doesn't stop and
immediately restart the queue. If it only stops the queue
*conditionally*, that may well fix the problem.

> But in either case NETDEV_TX_OK is returned and this is what the teql
> master transmit sees, and this takes the code path which advances the
> slave pointer to the next device.
> 
> Therefore the next teql master transmit should try the next device in
> the slave list, not the PPP device used in the previous call.

I instrumented everywhere that the 'next device' pointer (m->slaves) is
assigned in sch_teql. One of the printks you see below is in
teql_master_xmit(), and it's doing exactly what you say. And then
immediately afterwards you see the other printk in teql_dequeue(),
setting m->slaves right back to the original device again:

Mar 22 15:36:07 net1-173.woodhou.se kernel: [12612.673308] teql xmit cebca100 next cebca400
Mar 22 15:36:07 net1-173.woodhou.se kernel: [12612.677630] m->slaves becomes cebca100
Mar 22 15:36:07 net1-173.woodhou.se kernel: [12613.069589] teql xmit cebca100 next cebca400
Mar 22 15:36:07 net1-173.woodhou.se kernel: [12613.073884] m->slaves becomes cebca100
Mar 22 15:36:07 net1-173.woodhou.se kernel: [12613.113584] teql xmit cebca100 next cebca400
Mar 22 15:36:07 net1-173.woodhou.se kernel: [12613.117908] m->slaves becomes cebca100
Mar 22 15:36:08 net1-173.woodhou.se kernel: [12614.041113] teql xmit cebca100 next cebca400
Mar 22 15:36:08 net1-173.woodhou.se kernel: [12614.045411] m->slaves becomes cebca100
Mar 22 15:36:09 net1-173.woodhou.se kernel: [12614.258464] teql xmit cebca100 next cebca400
Mar 22 15:36:09 net1-173.woodhou.se kernel: [12614.262762] m->slaves becomes cebca100
Mar 22 15:36:09 net1-173.woodhou.se kernel: [12614.896259] teql xmit cebca100 next cebca400
Mar 22 15:36:09 net1-173.woodhou.se kernel: [12614.900559] m->slaves becomes cebca100
Mar 22 15:36:10 net1-173.woodhou.se kernel: [12616.129265] teql xmit cebca100 next cebca400
Mar 22 15:36:10 net1-173.woodhou.se kernel: [12616.133599] teql xmit cebca400 next cebca100
Mar 22 15:36:10 net1-173.woodhou.se kernel: [12616.137919] m->slaves becomes cebca100
Mar 22 15:36:10 net1-173.woodhou.se kernel: [12616.141673] m->slaves becomes cebca400
Mar 22 15:36:10 net1-173.woodhou.se kernel: [12616.148321] teql xmit cebca400 next cebca100
Mar 22 15:36:10 net1-173.woodhou.se kernel: [12616.152623] m->slaves becomes cebca400
Mar 22 15:36:10 net1-173.woodhou.se kernel: [12616.157979] teql xmit cebca400 next cebca100
Mar 22 15:36:10 net1-173.woodhou.se kernel: [12616.162276] m->slaves becomes cebca400
Mar 22 15:36:11 net1-173.woodhou.se kernel: [12616.172402] teql xmit cebca400 next cebca100
Mar 22 15:36:11 net1-173.woodhou.se kernel: [12616.176731] m->slaves becomes cebca400
Mar 22 15:36:13 net1-173.woodhou.se kernel: [12618.693948] teql xmit cebca400 next cebca100
Mar 22 15:36:13 net1-173.woodhou.se kernel: [12618.698275] m->slaves becomes cebca400
Mar 22 15:36:14 net1-173.woodhou.se kernel: [12619.263215] teql xmit cebca400 next cebca100
Mar 22 15:36:14 net1-173.woodhou.se kernel: [12619.267539] m->slaves becomes cebca400
Mar 22 15:36:14 net1-173.woodhou.se kernel: [12619.311534] teql xmit cebca400 next cebca100
Mar 22 15:36:14 net1-173.woodhou.se kernel: [12619.315828] m->slaves becomes cebca400
Mar 22 15:36:14 net1-173.woodhou.se kernel: [12619.645580] teql xmit cebca400 next cebca100

For the first few seconds it doesn't manage to send *any* packets out
the cebca400 queue. That queue gets marked as 'next', but never quite
makes it. And then it manages to flip, and for another few seconds it
sends *all* its packets out that queue, leaving the cebca100 queue idle.

-- 
dwmw2

Download attachment "smime.p7s" of type "application/x-pkcs7-signature" (5818 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ