lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Thu, 30 Apr 2015 10:34:17 -0400 From: Jason Baron <jbaron@...mai.com> To: David Miller <davem@...emloft.net> CC: Jason Baron <jbaron@...mai.com>, netdev@...r.kernel.org, eric.dumazet@...il.com Subject: Re: [PATCH] tcp: set SOCK_NOSPACE under memory presure On 04/22/2015 10:25 AM, Jason Baron wrote: > On 04/21/2015 05:33 PM, David Miller wrote: >> From: Jason Baron <jbaron@...mai.com> >> Date: Mon, 20 Apr 2015 20:05:13 +0000 (GMT) >> >>> Under tcp memory pressure, calling epoll_wait() in edge triggered >>> mode after -EAGAIN, can result in an indefinite hang in epoll_wait(), >>> even when there is suffcient memory available to continue making >>> progress. The problem is that __sk_mem_schedule() can return 0, >>> under memory pressure without having set the SOCK_NOSPACE flag. Thus, >>> even though all the outstanding packets have been acked, we never >>> get the EPOLLOUT that we are expecting from epoll_wait(). >>> >>> This issue is currently limited to epoll when used in edge trigger >>> mode, since 'tcp_poll()', does in fact currently set SOCK_NOSPACE. >>> This is sufficient for poll()/select() and epoll() in level trigger >>> mode. However, in edge trigger mode, epoll() is relying on the write >>> path to set SOCK_NOSPACE. So I view this patch as bringing us into >>> sync with poll()/select() and epoll() level trigger behavior. >> Can you explain exactly how epoll in edge trigger mode is >> depending upon SOCK_NOSPACE being set in this way? I tried >> to read the epoll code and it just seems to call ->poll() >> in the normal way when returning event state. > In edge trigger mode, when we receive a wakeup event we > call ->poll() in the normal way, *but* we do not leave the event > as still pending. (Specifically, in the epoll() code we are not > re-adding it (fs/eventpoll.c:ep_send_events_proc())). This is > because we are only interested in the 'edge' or the event > going high. In level trigger mode, we do leave the event > pending if its 'high', such that it will re-trigger again for us on > the next epoll_wait(). > > EPOLL(7) is clear that in edge-trigger mode we can only do > epoll_wait() after read/write return -EAGAIN. Thus, in the > case of the socket write, we are relying on the fact that > tcp_sendmsg()/network layer is going to issue a wakeup > for us at some point in the future when we get -EAGAIN. > > This all works fine in the case you pointed out where we > have exceeded the sk->sndbuf and set SOCK_NOSPACE. > However, when we return -EAGAIN from the write path > b/c we are over the tcp memory limits and not b/c we are > over the sndbuf, we are never going to get another wakeup > (since SOCK_NOSPACE is not set in this case). Level trigger > avoids this since the subsequent epoll_wait() is going to > re-try the ->poll() (and set SOCK_NOSPACE if it fails). > > Now, in the memory failure case, we are not really > waiting for the buffer to empty, but rather for there to be > memory more generally available. So it could be argued > that we need to implement a wakeup here based on memory > being available as opposed to the write queue emptying. > That is one potential option here. > > I think the other one is the route I was proposing, which was > to treat the out of memory case, in the same way as the > sk->sndbuf queue full case, as select(), poll() and epoll() level > trigger are currently doing. And potentially add maybe an > -ENOSPC return if the write queue really is empty...I thought > that approach made sense b/c even under memory pressure > (over sk_prot_mem_limits(sk, 1), but not over > sk_prot_mem_limits(sk, 2)) we continue to guarantee a > minimum sndbuf size (implying we can keep making progress). > That said, there is a case, over sk_prot_mem_limits(sk, 2), > where we do not guarantee the minimum buffer size, but I > think in practice that is very hard to hit (since we are reducing > usage over sk_prot_mem_limits(sk, 1) aggressively). > > There is also the case where we actually are out of memory > on the system, ie kmalloc() etc. are failing, in which case > we could maybe return -ENOSPC, or else we would potentially > need a larger change to wait on memory being available > as opposed to the buffer emptying. > > Thanks, > > -Jason Hi, Just curious if anybody had any further reaction on this issue. I think making the epoll edge trigger case, as least match what we are seeing for poll()/select()/epoll() level trigger seems reasonable here. Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@...r.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists