lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Mon, 11 Jul 2022 10:07:09 -0400 From: "Benjamin Coddington" <bcodding@...hat.com> To: "Trond Myklebust" <trondmy@...merspace.com>, "Eric Dumazet" <edumazet@...gle.com> Cc: edumazet@...gle.com, smayhew@...hat.com, davem@...emloft.net, chuck.lever@...cle.com, sfrench@...ba.org, tj@...nel.org, anna@...nel.org, kuba@...nel.org, jlayton@...nel.org, gnault@...hat.com, josef@...icpanda.com, netdev@...r.kernel.org, pabeni@...hat.com Subject: Re: [RFC net] Should sk_page_frag() also look at the current GFP context? On 8 Jul 2022, at 16:04, Trond Myklebust wrote: > On Fri, 2022-07-08 at 14:10 -0400, Benjamin Coddington wrote: >> On 7 Jul 2022, at 12:29, Eric Dumazet wrote: >> >>> On Fri, Jul 1, 2022 at 8:41 PM Guillaume Nault <gnault@...hat.com> >>> wrote: >>>> >>>> I'm investigating a kernel oops that looks similar to >>>> 20eb4f29b602 ("net: fix sk_page_frag() recursion from memory >>>> reclaim") >>>> and dacb5d8875cc ("tcp: fix page frag corruption on page fault"). >>>> >>>> This time the problem happens on an NFS client, while the >>>> previous >>>> bzs >>>> respectively used NBD and CIFS. While NBD and CIFS clear __GFP_FS >>>> in >>>> their socket's ->sk_allocation field (using GFP_NOIO or >>>> GFP_NOFS), >>>> NFS >>>> leaves sk_allocation to its default value since commit >>>> a1231fda7e94 >>>> ("SUNRPC: Set memalloc_nofs_save() on all rpciod/xprtiod jobs"). >>>> >>>> To recap the original problems, in commit 20eb4f29b602 and >>>> dacb5d8875cc, >>>> memory reclaim happened while executing tcp_sendmsg_locked(). The >>>> code >>>> path entered tcp_sendmsg_locked() recursively as pages to be >>>> reclaimed >>>> were backed by files on the network. The problem was that both >>>> the >>>> outer and the inner tcp_sendmsg_locked() calls used >>>> current->task_frag, >>>> thus leaving it in an inconsistent state. The fix was to use the >>>> socket's ->sk_frag instead for the file system socket, so that >>>> the >>>> inner and outer calls wouln't step on each other's toes. >>>> >>>> But now that NFS doesn't modify ->sk_allocation anymore, >>>> sk_page_frag() >>>> sees sunrpc sockets as plain TCP ones and returns ->task_frag in >>>> the >>>> inner tcp_sendmsg_locked() call. >>>> >>>> Also it looks like the trend is to avoid GFS_NOFS and GFP_NOIO >>>> and >>>> use >>>> memalloc_no{fs,io}_save() instead. So maybe other network file >>>> systems >>>> will also stop setting ->sk_allocation in the future and we >>>> should >>>> teach sk_page_frag() to look at the current GFP flags. Or should >>>> we >>>> stick to ->sk_allocation and make NFS drop __GFP_FS again? >>>> >>>> Signed-off-by: Guillaume Nault <gnault@...hat.com> >>> >>> Can you provide a Fixes: tag ? >>> >>>> --- >>>> include/net/sock.h | 8 ++++++-- >>>> 1 file changed, 6 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/include/net/sock.h b/include/net/sock.h >>>> index 72ca97ccb460..b934c9851058 100644 >>>> --- a/include/net/sock.h >>>> +++ b/include/net/sock.h >>>> @@ -46,6 +46,7 @@ >>>> #include <linux/netdevice.h> >>>> #include <linux/skbuff.h> /* struct sk_buff */ >>>> #include <linux/mm.h> >>>> +#include <linux/sched/mm.h> >>>> #include <linux/security.h> >>>> #include <linux/slab.h> >>>> #include <linux/uaccess.h> >>>> @@ -2503,14 +2504,17 @@ static inline void >>>> sk_stream_moderate_sndbuf(struct sock *sk) >>>> * socket operations and end up recursing into sk_page_frag() >>>> * while it's already in use: explicitly avoid task page_frag >>>> * usage if the caller is potentially doing any of them. >>>> - * This assumes that page fault handlers use the GFP_NOFS flags. >>>> + * This assumes that page fault handlers use the GFP_NOFS flags >>>> + * or run under memalloc_nofs_save() protection. >>>> * >>>> * Return: a per task page_frag if context allows that, >>>> * otherwise a per socket one. >>>> */ >>>> static inline struct page_frag *sk_page_frag(struct sock *sk) >>>> { >>>> - if ((sk->sk_allocation & (__GFP_DIRECT_RECLAIM | >>>> __GFP_MEMALLOC | __GFP_FS)) == >>>> + gfp_t gfp_mask = current_gfp_context(sk->sk_allocation); >>> >>> This is slowing down TCP sendmsg() fast path, reading current- >>>> flags, >>> possibly cold value. >> >> True - current->flags is pretty distant from current->task_frag. >> >>> I would suggest using one bit in sk, close to sk->sk_allocation to >>> make the decision, >>> instead of testing sk->sk_allocation for various flags. >>> >>> Not sure if we have available holes. >> >> Its looking pretty packed on my build.. the nearest hole is 5 >> cachelines >> away. >> >> It'd be nice to allow network filesystem to use task_frag when >> possible. >> >> If we expect sk_page_frag() to only return task_frag once per call >> stack, >> then can we simply check it's already in use, perhaps by looking at >> the >> size field? >> >> Or maybe can we set sk_allocation early from current_gfp_context() >> outside >> the fast path? > > Why not just add a bit to sk->sk_allocation itself, and have > __sock_create() default to setting it when the 'kern' parameter is non- > zero? NFS is not alone in following the request of the mm team to > deprecate use of GFP_NOFS and GFP_NOIO. Can we overload sk_allocation safely? There's 28 GFP flags already, I'm worried about unintended consequences if sk_allocation gets passed on. What about a flag in sk_gso_type? Looks like there's 13 free there, and its in the same cacheline as sk_allocation and sk_frag. Ben
Powered by blists - more mailing lists