lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Mon, 11 Jul 2022 10:07:09 -0400
From:   "Benjamin Coddington" <bcodding@...hat.com>
To:     "Trond Myklebust" <trondmy@...merspace.com>,
        "Eric Dumazet" <edumazet@...gle.com>
Cc:     edumazet@...gle.com, smayhew@...hat.com, davem@...emloft.net,
        chuck.lever@...cle.com, sfrench@...ba.org, tj@...nel.org,
        anna@...nel.org, kuba@...nel.org, jlayton@...nel.org,
        gnault@...hat.com, josef@...icpanda.com, netdev@...r.kernel.org,
        pabeni@...hat.com
Subject: Re: [RFC net] Should sk_page_frag() also look at the current GFP
 context?

On 8 Jul 2022, at 16:04, Trond Myklebust wrote:

> On Fri, 2022-07-08 at 14:10 -0400, Benjamin Coddington wrote:
>> On 7 Jul 2022, at 12:29, Eric Dumazet wrote:
>>
>>> On Fri, Jul 1, 2022 at 8:41 PM Guillaume Nault <gnault@...hat.com>
>>> wrote:
>>>>
>>>> I'm investigating a kernel oops that looks similar to
>>>> 20eb4f29b602 ("net: fix sk_page_frag() recursion from memory
>>>> reclaim")
>>>> and dacb5d8875cc ("tcp: fix page frag corruption on page fault").
>>>>
>>>> This time the problem happens on an NFS client, while the
>>>> previous
>>>> bzs
>>>> respectively used NBD and CIFS. While NBD and CIFS clear __GFP_FS
>>>> in
>>>> their socket's ->sk_allocation field (using GFP_NOIO or
>>>> GFP_NOFS),
>>>> NFS
>>>> leaves sk_allocation to its default value since commit
>>>> a1231fda7e94
>>>> ("SUNRPC: Set memalloc_nofs_save() on all rpciod/xprtiod jobs").
>>>>
>>>> To recap the original problems, in commit 20eb4f29b602 and
>>>> dacb5d8875cc,
>>>> memory reclaim happened while executing tcp_sendmsg_locked(). The
>>>> code
>>>> path entered tcp_sendmsg_locked() recursively as pages to be
>>>> reclaimed
>>>> were backed by files on the network. The problem was that both
>>>> the
>>>> outer and the inner tcp_sendmsg_locked() calls used
>>>> current->task_frag,
>>>> thus leaving it in an inconsistent state. The fix was to use the
>>>> socket's ->sk_frag instead for the file system socket, so that
>>>> the
>>>> inner and outer calls wouln't step on each other's toes.
>>>>
>>>> But now that NFS doesn't modify ->sk_allocation anymore,
>>>> sk_page_frag()
>>>> sees sunrpc sockets as plain TCP ones and returns ->task_frag in
>>>> the
>>>> inner tcp_sendmsg_locked() call.
>>>>
>>>> Also it looks like the trend is to avoid GFS_NOFS and GFP_NOIO
>>>> and
>>>> use
>>>> memalloc_no{fs,io}_save() instead. So maybe other network file
>>>> systems
>>>> will also stop setting ->sk_allocation in the future and we
>>>> should
>>>> teach sk_page_frag() to look at the current GFP flags. Or should
>>>> we
>>>> stick to ->sk_allocation and make NFS drop __GFP_FS again?
>>>>
>>>> Signed-off-by: Guillaume Nault <gnault@...hat.com>
>>>
>>> Can you provide a Fixes: tag ?
>>>
>>>> ---
>>>>  include/net/sock.h | 8 ++++++--
>>>>  1 file changed, 6 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/include/net/sock.h b/include/net/sock.h
>>>> index 72ca97ccb460..b934c9851058 100644
>>>> --- a/include/net/sock.h
>>>> +++ b/include/net/sock.h
>>>> @@ -46,6 +46,7 @@
>>>>  #include <linux/netdevice.h>
>>>>  #include <linux/skbuff.h>      /* struct sk_buff */
>>>>  #include <linux/mm.h>
>>>> +#include <linux/sched/mm.h>
>>>>  #include <linux/security.h>
>>>>  #include <linux/slab.h>
>>>>  #include <linux/uaccess.h>
>>>> @@ -2503,14 +2504,17 @@ static inline void
>>>> sk_stream_moderate_sndbuf(struct sock *sk)
>>>>   * socket operations and end up recursing into sk_page_frag()
>>>>   * while it's already in use: explicitly avoid task page_frag
>>>>   * usage if the caller is potentially doing any of them.
>>>> - * This assumes that page fault handlers use the GFP_NOFS flags.
>>>> + * This assumes that page fault handlers use the GFP_NOFS flags
>>>> + * or run under memalloc_nofs_save() protection.
>>>>   *
>>>>   * Return: a per task page_frag if context allows that,
>>>>   * otherwise a per socket one.
>>>>   */
>>>>  static inline struct page_frag *sk_page_frag(struct sock *sk)
>>>>  {
>>>> -       if ((sk->sk_allocation & (__GFP_DIRECT_RECLAIM |
>>>> __GFP_MEMALLOC | __GFP_FS)) ==
>>>> +       gfp_t gfp_mask = current_gfp_context(sk->sk_allocation);
>>>
>>> This is slowing down TCP sendmsg() fast path, reading current-
>>>> flags,
>>> possibly cold value.
>>
>> True - current->flags is pretty distant from current->task_frag.
>>
>>> I would suggest using one bit in sk, close to sk->sk_allocation to
>>> make the decision,
>>> instead of testing sk->sk_allocation for various flags.
>>>
>>> Not sure if we have available holes.
>>
>> Its looking pretty packed on my build.. the nearest hole is 5
>> cachelines
>> away.
>>
>> It'd be nice to allow network filesystem to use task_frag when
>> possible.
>>
>> If we expect sk_page_frag() to only return task_frag once per call
>> stack,
>> then can we simply check it's already in use, perhaps by looking at
>> the
>> size field?
>>
>> Or maybe can we set sk_allocation early from current_gfp_context()
>> outside
>> the fast path?
>
> Why not just add a bit to sk->sk_allocation itself, and have
> __sock_create() default to setting it when the 'kern' parameter is non-
> zero? NFS is not alone in following the request of the mm team to
> deprecate use of GFP_NOFS and GFP_NOIO.

Can we overload sk_allocation safely?  There's 28 GFP flags already, I'm
worried about unintended consequences if sk_allocation gets passed on.

What about a flag in sk_gso_type?  Looks like there's 13 free there, and its
in the same cacheline as sk_allocation and sk_frag.

Ben

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ