lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Message-ID: <CANn89i+=GyHjkrHMZAftB-toEhi9GcAQom1_bpT+S0qMvCz0DQ@mail.gmail.com> Date: Thu, 7 Jul 2022 18:29:03 +0200 From: Eric Dumazet <edumazet@...gle.com> To: Guillaume Nault <gnault@...hat.com> Cc: "David S. Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, netdev <netdev@...r.kernel.org>, Chuck Lever <chuck.lever@...cle.com>, Jeff Layton <jlayton@...nel.org>, Trond Myklebust <trond.myklebust@...merspace.com>, Anna Schumaker <anna@...nel.org>, Steve French <sfrench@...ba.org>, Josef Bacik <josef@...icpanda.com>, Scott Mayhew <smayhew@...hat.com>, Benjamin Coddington <bcodding@...hat.com>, Tejun Heo <tj@...nel.org> Subject: Re: [RFC net] Should sk_page_frag() also look at the current GFP context? On Fri, Jul 1, 2022 at 8:41 PM Guillaume Nault <gnault@...hat.com> wrote: > > I'm investigating a kernel oops that looks similar to > 20eb4f29b602 ("net: fix sk_page_frag() recursion from memory reclaim") > and dacb5d8875cc ("tcp: fix page frag corruption on page fault"). > > This time the problem happens on an NFS client, while the previous bzs > respectively used NBD and CIFS. While NBD and CIFS clear __GFP_FS in > their socket's ->sk_allocation field (using GFP_NOIO or GFP_NOFS), NFS > leaves sk_allocation to its default value since commit a1231fda7e94 > ("SUNRPC: Set memalloc_nofs_save() on all rpciod/xprtiod jobs"). > > To recap the original problems, in commit 20eb4f29b602 and dacb5d8875cc, > memory reclaim happened while executing tcp_sendmsg_locked(). The code > path entered tcp_sendmsg_locked() recursively as pages to be reclaimed > were backed by files on the network. The problem was that both the > outer and the inner tcp_sendmsg_locked() calls used current->task_frag, > thus leaving it in an inconsistent state. The fix was to use the > socket's ->sk_frag instead for the file system socket, so that the > inner and outer calls wouln't step on each other's toes. > > But now that NFS doesn't modify ->sk_allocation anymore, sk_page_frag() > sees sunrpc sockets as plain TCP ones and returns ->task_frag in the > inner tcp_sendmsg_locked() call. > > Also it looks like the trend is to avoid GFS_NOFS and GFP_NOIO and use > memalloc_no{fs,io}_save() instead. So maybe other network file systems > will also stop setting ->sk_allocation in the future and we should > teach sk_page_frag() to look at the current GFP flags. Or should we > stick to ->sk_allocation and make NFS drop __GFP_FS again? > > Signed-off-by: Guillaume Nault <gnault@...hat.com> Can you provide a Fixes: tag ? > --- > include/net/sock.h | 8 ++++++-- > 1 file changed, 6 insertions(+), 2 deletions(-) > > diff --git a/include/net/sock.h b/include/net/sock.h > index 72ca97ccb460..b934c9851058 100644 > --- a/include/net/sock.h > +++ b/include/net/sock.h > @@ -46,6 +46,7 @@ > #include <linux/netdevice.h> > #include <linux/skbuff.h> /* struct sk_buff */ > #include <linux/mm.h> > +#include <linux/sched/mm.h> > #include <linux/security.h> > #include <linux/slab.h> > #include <linux/uaccess.h> > @@ -2503,14 +2504,17 @@ static inline void sk_stream_moderate_sndbuf(struct sock *sk) > * socket operations and end up recursing into sk_page_frag() > * while it's already in use: explicitly avoid task page_frag > * usage if the caller is potentially doing any of them. > - * This assumes that page fault handlers use the GFP_NOFS flags. > + * This assumes that page fault handlers use the GFP_NOFS flags > + * or run under memalloc_nofs_save() protection. > * > * Return: a per task page_frag if context allows that, > * otherwise a per socket one. > */ > static inline struct page_frag *sk_page_frag(struct sock *sk) > { > - if ((sk->sk_allocation & (__GFP_DIRECT_RECLAIM | __GFP_MEMALLOC | __GFP_FS)) == > + gfp_t gfp_mask = current_gfp_context(sk->sk_allocation); This is slowing down TCP sendmsg() fast path, reading current->flags, possibly cold value. I would suggest using one bit in sk, close to sk->sk_allocation to make the decision, instead of testing sk->sk_allocation for various flags. Not sure if we have available holes. > + > + if ((gfp_mask & ( | __GFP_MEMALLOC | __GFP_FS)) == > (__GFP_DIRECT_RECLAIM | __GFP_FS)) > return ¤t->task_frag; > > -- > 2.21.3 >
Powered by blists - more mailing lists