netdev - Re: [Bugme-new] [Bug 12201] New: long wait in call_usermodehelper() / queue_work() / wait_for

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-Id: <20081211143758.510b51b6.akpm@linux-foundation.org>
Date:	Thu, 11 Dec 2008 14:37:58 -0800
From:	Andrew Morton <akpm@...ux-foundation.org>
To:	mike@...ticaltech.com
Cc:	bugme-daemon@...zilla.kernel.org, netdev@...r.kernel.org,
	linux-kernel@...r.kernel.org, Al Viro <viro@...iv.linux.org.uk>
Subject: Re: [Bugme-new] [Bug 12201] New: long wait in call_usermodehelper()
 / queue_work() / wait_for_completion()


(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Thu, 11 Dec 2008 14:15:21 -0800 (PST)
bugme-daemon@...zilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=12201
> 
>            Summary: long wait in call_usermodehelper() / queue_work() /
>                     wait_for_completion()
>            Product: Process Management
>            Version: 2.5
>      KernelVersion: 2.6.26.8
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Other
>         AssignedTo: process_other@...nel-bugs.osdl.org
>         ReportedBy: mike@...ticaltech.com
> 
> 
> Latest working kernel version: None
> Earliest failing kernel version: 2.6.26 (can't test any older)

It'd be great if you could test something more recent please.

> Distribution: CentOS 5
> Hardware Environment: Sun x4450, 16-cores, 128GB of RAM
> Software Environment: CentOS 5 + Apache webserver
> Problem Description:
> My problem started with SSH using the audit library, and my kernel not having
> AUDIT support.  During strace, any call to socket(PF_NETLINK, SOCK_RAW,
> NETLINK_AUDIT) took 1-2 seconds to return.  During this time, sys% was high.

Well this is bad.  We don't want the kernel calling out to userspace
each time you run socket(PF_NETLINK, ...).  The performance could be
awful.

I don't know if this is a net problem, an audit problem or whatever. 
Probably the offending kernel code simply shouldn't exist if
CONFIG_AUDIT=n.

Please attach a copy of the config to
http://bugzilla.kernel.org/show_bug.cgi?id=12201

> As I continued to dig deeper (using lots of printks), I found that these delays
> were caused by the netlink_create() code calling request_module() to find/load
> a module for AUDIT support which doesn't exist.
> 
> Continuing to dig, I found that request_module() uses call_usermodehelper() to
> run /sbin/modprobe to find/load the module.
> 
> The farthest I got is that after the process is created, we call
> wait_for_completion() to get the result of that process.  This waiting process
> takes 1-2 seconds.
> 
> The big problem in troubleshooting here is that this only starts to happen
> after the server has been online for a while (10 days maybe) and serving lots
> of traffic.  The delay gradually builds up and maxes out at around 2 seconds.
> 
> If I manually call /sbin/modprobe on the commandline and provide it the same
> arguments that call_usermodehelper() uses, the command returns instantly 100%
> of the time (assuming server has been on for a while).
> 
> If I write a small pilot program that calls socket(PF_NETLINK, SOCK_RAW,
> NETLINK_AUDIT), it will delay by 1-2 seconds 100% of the time (assuming server
> has been online for a while).  Certain protocol types given to socket() have
> zero delay (because no module needs to be loaded).
> 
> Steps to reproduce:
> Once server has been online for a while, a simple call to socket(PF_NETLINK,
> SOCK_RAW, NETLINK_AUDIT) shows the problem.

OK, weird.

Please get sysrq working then get us a task trace, so we can see who is
sleeping where.  Do this:


- run your "small pilot program"

- wait one second (so we catch it while it is delaying)

- echo t > /proc/sysrq-trigger

- dmesg -s 1000000 > foo

- send us a copy of foo.

(foo will be large, so it would be best to attach it to
http://bugzilla.kernel.org/show_bug.cgi?id=12201 then email us the URL)

(Using `echo w > /proc/sysrq-trigger' would work too, if we know that
the offending processes are stuck in D state.  It will produce less
output)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html