netdev - Re: Killing sk->sk_callback

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Mon, 16 Jun 2008 22:01:13 -0600
From:	"Gregory Haskins" <ghaskins@...ell.com>
To:	"David Miller" <davem@...emloft.net>,
	"Patrick Mullaney" <PMullaney@...ell.com>
Cc:	<herbert@...dor.apana.org.au>, <chuck.lever@...cle.com>,
	<netdev@...r.kernel.org>
Subject: Re: Killing sk->sk_callback_lock

>>> On Mon, Jun 16, 2008 at  9:53 PM, in message
<20080616.185328.85842051.davem@...emloft.net>, David Miller
<davem@...emloft.net> wrote: 
> From: "Patrick Mullaney" <pmullaney@...ell.com>
> Date: Mon, 16 Jun 2008 19:38:23 -0600
> 
>> The overhead I was trying to address was scheduler overhead.
> 
> Neither Herbert nor I are convinced of this yet, and you have
> to show us why you think this is the problem and not (in
> our opinion) the more likely sk_callback_lock overhead.

Please bear with us.  It is not our intent to be annoying, but we are perhaps doing a poor job of describing the actual nature of the issue we are seeing.

To be clear on how we got to this point: We are tasked with improving the performance of our particular kernel configuration.  We observed that our configuration was comparatively poor at UDP performance, so we started investigating why.  We instrumented the kernel using various tools (lockdep, oprofile, logdev, etc), and observed two wakeups for every packet that was received while running a multi-threaded netperf UDP throughput benchmark on some 8-core boxes.

A common pattern would emerge during analysis of the instrumentation data: An arbitrary client thread would block on the wait-queue waiting for a UDP packet.  It would of course initially block because there were no packets available.  It would then wake up, check the queue, see that there are still no packets available, and go back to sleep.  It would then wake up again a short time later, find a packet, and return to userspace.

This seemed odd to us, so we investigated further to see if an improvement was lurking or whether this was expected.  We traced back the source of each wakeup to be coming from 1) the wmem/nospace code, and 2) from the rx-wakeup code from the softirq.  First the softirq would process the tx-completions which would wake_up() the wait-queue for NOSPACE signaling.  Since the client was waiting for a packet on the same wait-queue, this was where the first wakeup came from.  Then later the softirq finally pushed an actual packet to the queue, and the client was once again re-awoken via the same overloaded wait-queue.  This time it would successfully find a packet and return to userspace.

Since the client does not care about wmem/nospace in the UDP rx path, yet the two events share a single wait-queue, the first wakeup was completely wasted.  It just causes extra scheduling activity that does not help in any way (and is quite expensive in the grand-scheme of things).  Based on this lead, Pat devised a solution which eliminates the extra wake-up() when there are no clients waiting for that particular NOSPACE event.  With his patch applied, we observed two things:

1) We now had 1 wake-up per packet, instead of 2 (decreasing context switching rates by ~50%)
2) overall UDP throughput performance increased by ~25%

This was true even without the presence of our instrumentation, so I don't think we can chalk up the "double-wake" analysis as an anomaly caused by the presence of the instrumentation itself.  Based on that, it would at least appear that the odd behavior w.r.t. the phantom wakeup does indeed hinder performance.  This is not to say that the locking issues you highlight are not also an issue.  But note that we have no evidence to suggest this particular phenomenon it is related in any way to the locking (in fact, sk_callback_lock was not showing up at all on the lockdep radar for this particular configuration, indicating a low contention rate).

So by all means, if there are improvements to the locking that can be made, thats great!  But fixing the locking will not likely address this scheduler overhead that Pat refers to IMO.  They would appear to be orthogonal issues.  I will keep an open mind, but the root-cause seems to be either the tendency of the stack code to overload the wait-queue, or the fact that UDP sockets do not dynamically manage the NOSPACE state flags.  From my perspective, I am not married to any one particular solution as long as the fundamental "phantom wakeup" problem is addressed.

HTH

Regards,
-Greg

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html