netdev - Re: [PATCH net] tcp: fix page frag corruption on page fault

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <169f6a93856664dd4001840081c82f792ae1dc99.camel@redhat.com>
Date:   Fri, 26 Nov 2021 16:05:10 +0100
From:   Paolo Abeni <pabeni@...hat.com>
To:     Eric Dumazet <edumazet@...gle.com>
Cc:     netdev@...r.kernel.org, Steffen Froemer <sfroemer@...hat.com>
Subject: Re: [PATCH net] tcp: fix page frag corruption on page fault

Hello,

On Fri, 2021-11-26 at 06:18 -0800, Eric Dumazet wrote:
> On Fri, Nov 26, 2021 at 4:00 AM Paolo Abeni <pabeni@...hat.com> wrote:
> > 
> > Steffen reported a TCP stream corruption for HTTP requests
> > served by the apache web-server using a cifs mount-point
> > and memory mapping the relevant file.
> > 
> > The root cause is quite similar to the one addressed by
> > commit 20eb4f29b602 ("net: fix sk_page_frag() recursion from
> > memory reclaim"). Here the nested access to the task page frag
> > is caused by a page fault on the (mmapped) user-space memory
> > buffer coming from the cifs file.
> > 
> > The page fault handler performs an smb transaction on a different
> > socket, inside the same process context. Since sk->sk_allaction
> > for such socket does not prevent the usage for the task_frag,
> > the nested allocation modify "under the hood" the page frag
> > in use by the outer sendmsg call, corrupting the stream.
> > 
> > The overall relevant stack trace looks like the following:
> > 
> > httpd 78268 [001] 3461630.850950:      probe:tcp_sendmsg_locked:
> >         ffffffff91461d91 tcp_sendmsg_locked+0x1
> >         ffffffff91462b57 tcp_sendmsg+0x27
> >         ffffffff9139814e sock_sendmsg+0x3e
> >         ffffffffc06dfe1d smb_send_kvec+0x28
> >         [...]
> >         ffffffffc06cfaf8 cifs_readpages+0x213
> >         ffffffff90e83c4b read_pages+0x6b
> >         ffffffff90e83f31 __do_page_cache_readahead+0x1c1
> >         ffffffff90e79e98 filemap_fault+0x788
> >         ffffffff90eb0458 __do_fault+0x38
> >         ffffffff90eb5280 do_fault+0x1a0
> >         ffffffff90eb7c84 __handle_mm_fault+0x4d4
> >         ffffffff90eb8093 handle_mm_fault+0xc3
> >         ffffffff90c74f6d __do_page_fault+0x1ed
> >         ffffffff90c75277 do_page_fault+0x37
> >         ffffffff9160111e page_fault+0x1e
> >         ffffffff9109e7b5 copyin+0x25
> >         ffffffff9109eb40 _copy_from_iter_full+0xe0
> >         ffffffff91462370 tcp_sendmsg_locked+0x5e0
> >         ffffffff91462370 tcp_sendmsg_locked+0x5e0
> >         ffffffff91462b57 tcp_sendmsg+0x27
> >         ffffffff9139815c sock_sendmsg+0x4c
> >         ffffffff913981f7 sock_write_iter+0x97
> >         ffffffff90f2cc56 do_iter_readv_writev+0x156
> >         ffffffff90f2dff0 do_iter_write+0x80
> >         ffffffff90f2e1c3 vfs_writev+0xa3
> >         ffffffff90f2e27c do_writev+0x5c
> >         ffffffff90c042bb do_syscall_64+0x5b
> >         ffffffff916000ad entry_SYSCALL_64_after_hwframe+0x65
> > 
> > A possible solution would be adding the __GFP_MEMALLOC flag
> > to the cifs allocation. That looks dangerous, as the memory
> > allocated by the cifs fs will not be free soon and such
> > allocation will not allow for more memory to be freed.
> > 
> > Instead, this patch changes the tcp_sendmsg() code to avoid
> > touching the page frag after performing the copy from the
> > user-space buffer. Any page fault or memory reclaim operation
> > there is now free to touch the task page fragment without
> > corrupting the state used by the outer sendmsg().
> > 
> > As a downside, if the user-space copy fails, there will be
> > some additional atomic operations due to the reference counting
> > on the faulty fragment, but that looks acceptable for a slow
> > error path.
> > 
> > Reported-by: Steffen Froemer <sfroemer@...hat.com>
> > Fixes: 5640f7685831 ("net: use a per task frag allocator")
> > Signed-off-by: Paolo Abeni <pabeni@...hat.com>
> > ---
> >  net/ipv4/tcp.c | 20 ++++++++++++--------
> >  1 file changed, 12 insertions(+), 8 deletions(-)
> > 
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index bbb3d39c69af..2d85636c1577 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -1304,6 +1304,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
> >                         bool merge = true;
> >                         int i = skb_shinfo(skb)->nr_frags;
> >                         struct page_frag *pfrag = sk_page_frag(sk);
> > +                       unsigned int offset;
> > 
> >                         if (!sk_page_frag_refill(sk, pfrag))
> >                                 goto wait_for_space;
> > @@ -1331,14 +1332,11 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
> >                         if (!sk_wmem_schedule(sk, copy))
> >                                 goto wait_for_space;
> > 
> > -                       err = skb_copy_to_page_nocache(sk, &msg->msg_iter, skb,
> > -                                                      pfrag->page,
> > -                                                      pfrag->offset,
> > -                                                      copy);
> > -                       if (err)
> > -                               goto do_error;
> > -
> > -                       /* Update the skb. */
> > +                       /* Update the skb before accessing the user space buffer
> > +                        * so that we leave the task frag in a consistent state.
> > +                        * Just in case the page_fault handler need to use it
> > +                        */
> > +                       offset = pfrag->offset;
> >                         if (merge) {
> >                                 skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
> >                         } else {
> > @@ -1347,6 +1345,12 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
> >                                 page_ref_inc(pfrag->page);
> >                         }
> >                         pfrag->offset += copy;
> > +
> > +                       err = skb_copy_to_page_nocache(sk, &msg->msg_iter, skb,
> > +                                                      pfrag->page,
> > +                                                      offset, copy);
> > +                       if (err)
> > +                               goto do_error;
> >                 } else {
> >                         /* First append to a fragless skb builds initial
> >                          * pure zerocopy skb
> > --
> > 2.33.1
> > 
> 
> This patch is completely wrong, you just horribly broke TCP.

Double checking I understood correctly: the problem is that in case of
skb_copy_to_page_nocache() error, if the skb is not empty, random data
will be introduced into the TCP stream, or something else/more? I
obviously did not see that before the submission, nor tests cached it,
sorry.

> Please investigate CIFS and gfpflags_normal_context() tandem to fix
> this issue instead.

Do you mean changing gfpflags_normal_context() definition so that cifs
allocation are excluded? Something alike the following should do:

---
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index b976c4177299..f9286aeeded5 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -379,8 +379,8 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
  */
 static inline bool gfpflags_normal_context(const gfp_t gfp_flags)
 {
-       return (gfp_flags & (__GFP_DIRECT_RECLAIM | __GFP_MEMALLOC)) ==
-               __GFP_DIRECT_RECLAIM;
+       return (gfp_flags & (__GFP_DIRECT_RECLAIM | __GFP_MEMALLOC | __GFP_FS)) ==
+               (__GFP_DIRECT_RECLAIM | __GFP_FS);
 }
---
If so there is a caveat: dlm is currently using
gfpflags_normal_context() - apparently to check for non blocking
context. If I'll change gfpflags_normal_context() definition I likely
will have to replace gfpflags_normal_context() with
gfpflags_allow_blocking() in dlm.

In that case the relevant patch should touch both the mm and the fs
subsystem. In that case I guess I should go via the fs tree first and
the via mm?

Thanks!

Paolo