linux-kernel - Re: bad networking related lag in v2.6.22-rc2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20070523063052.GB26814@elte.hu>
Date:	Wed, 23 May 2007 08:30:52 +0200
From:	Ingo Molnar <mingo@...e.hu>
To:	Anant Nitya <kernel@...chanda.info>
Cc:	linux-kernel@...r.kernel.org, Patrick McHardy <kaber@...sh.net>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	"David S. Miller" <davem@...emloft.net>
Subject: Re: bad networking related lag in v2.6.22-rc2


* Anant Nitya <kernel@...chanda.info> wrote:

> > could you also apply the fix for the softirq problem below, to make 
> > sure it does not interact?

> Above patch does solve __ soft_irq_pending __ problem. I am running 
> this patch with kernel 2.6.21.1 since last day doing all kinda things 
> but haven't encountered any __ NOHZ: local_softirq_pending __. But 
> network lag that I am seeing since 2.6.22-rc1 is still there even with 
> this patch applied. If you need any more information please do ask. 
> Meanwhile I will do gitbisect as suggested by linus to find out the 
> specific commit that introduced this problem and will inform once I 
> find it. Its good to see system running without any __ 
> local_softirq_problem __ :)

thanks.

if you feel inclined to try the git-bisection then by all means please 
do it (it will certainly be helpful and educative), but it's optional: i 
dont think you should 'need' to go through extra debugging chores, my 
analysis based on the excellent trace you provided still holds and 
whoever modified htb_dequeue()'s logic recently ought to be able to 
figure that out (or send you a debug patch to further narrow the problem 
down).

The trace shows a _clearly_ anomalous loop: for example there's 56396 
(!) calls to rb_first() in htb_dequeue() [without the kernel ever 
exiting that function]:

  earth4:~/s> grep rb_first trace-to-ingo.txt  | wc -l
  56396

and the set of rules you are using are alot simpler and the networking 
load you are using is not large by any means. Here's the trace analysis 
below again.

	Ingo

----------------------->

> http://cybertek.info/taitai/trace-to-ingo.txt.bz2

This trace indeed includes the smoking gun, htb_dequeue() and 
__qdisc_run():

   privoxy-12926 1.Ns1 1597us : rb_first (htb_dequeue)

this goes on, non-preemptible, for 160 milliseconds (!):

 privoxy-12926 1.Ns1 161568us : rb_first (htb_dequeue)
 privoxy-12926 1.Ns1 161568us : qdisc_watchdog_schedule (htb_dequeue)

and finally manages to escape the loop:

 privoxy-12926 1.Ns1 161597us : rb_first (htb_dequeue)
 privoxy-12926 1.Ns1 161597us : rb_first (htb_dequeue)
 privoxy-12926 1.Ns1 161599us : htb_safe_rb_erase (htb_dequeue)
 privoxy-12926 1.Ns1 161599us : rb_erase (htb_safe_rb_erase)
 privoxy-12926 1.Ns1 161600us : htb_change_class_mode (htb_dequeue)
 privoxy-12926 1.Ns1 161601us : htb_activate_prios (htb_change_class_mode)

and the system recovers.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/