lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <009601da1945$2ff0d0c0$8fd27240$@wangsu.com>
Date: Fri, 17 Nov 2023 18:59:50 +0800
From: "Pengcheng Yang" <yangpc@...gsu.com>
To: "'John Fastabend'" <john.fastabend@...il.com>,
	"'Jakub Sitnicki'" <jakub@...udflare.com>,
	"'Eric Dumazet'" <edumazet@...gle.com>,
	"'Jakub Kicinski'" <kuba@...nel.org>,
	<bpf@...r.kernel.org>,
	<netdev@...r.kernel.org>
Subject: Re: [PATCH bpf-next 2/3] tcp: Add the data length in skmsg to SIOCINQ ioctl

John Fastabend <john.fastabend@...il.com> wrote:
> Pengcheng Yang wrote:
> > John Fastabend <john.fastabend@...il.com> wrote:
> > > Pengcheng Yang wrote:
> > > > SIOCINQ ioctl returns the number unread bytes of the receive
> > > > queue but does not include the ingress_msg queue. With the
> > > > sk_msg redirect, an application may get a value 0 if it calls
> > > > SIOCINQ ioctl before recv() to determine the readable size.
> > > >
> > > > Signed-off-by: Pengcheng Yang <yangpc@...gsu.com>
> > >
> > > This will break the SK_PASS case I believe. Here we do
> > > not update copied_seq until data is actually copied into user
> > > space. This also ensures tcp_epollin_ready works correctly and
> > > tcp_inq. The fix is relatively recent.
> > >
> > >  commit e5c6de5fa025882babf89cecbed80acf49b987fa
> > >  Author: John Fastabend <john.fastabend@...il.com>
> > >  Date:   Mon May 22 19:56:12 2023 -0700
> > >
> > >     bpf, sockmap: Incorrectly handling copied_seq
> > >
> > > The previous patch increments the msg_len for all cases even
> > > the SK_PASS case so you will get double counting.
> >
> > You are right, I missed the SK_PASS case of skb stream verdict.
> >
> > >
> > > I was starting to poke around at how to fix the other cases e.g.
> > > stream parser is in use and redirects but haven't got to it  yet.
> > > By the way I think even with this patch epollin_ready is likely
> > > not correct still. We observe this as either failing to wake up
> > > or waking up an application to early when using stream parser.
> > >
> > > The other thing to consider is redirected skb into another socket
> > > and then read off the list increment the copied_seq even though
> > > they shouldn't if they came from another sock?  The result would
> > > be tcp_inq would be incorrect even negative perhaps?
> > >
> > > What does your test setup look like? Simple redirect between
> > > two TCP sockets? With or without stream parser? My guess is we
> > > need to fix underlying copied_seq issues related to the redirect
> > > and stream parser case. I believe the fix is, only increment
> > > copied_seq for data that was put on the ingress_queue from SK_PASS.
> > > Then update previous patch to only incrmeent sk_msg_queue_len()
> > > for redirect paths. And this patch plus fix to tcp_epollin_ready
> > > would resolve most the issues. Its a bit unfortunate to leak the
> > > sk_sg_queue_len() into tcp_ioctl and tcp_epollin but I don't have
> > > a cleaner idea right now.
> > >
> >
> > What I tested was to use msg_verdict to redirect between two sockets
> > without stream parser, and the problem I encountered is that msg has
> > been queued in psock->ingress_msg, and the application has been woken up
> > by epoll (because of sk_psock_data_ready), but the ioctl(FIONREAD) returns 0.
> 
> Yep makes sense.
> 
> >
> > The key is that the rcv_nxt is not updated on ingress redirect, or we only need
> > to update rcv_nxt on ingress redirect, such as in bpf_tcp_ingress() and
> > sk_psock_skb_ingress_enqueue() ?
> >
> 
> I think its likely best not to touch rcv_nxt. 'rcv_nxt' is used in
> the tcp stack to calculate lots of things. If you just bump it and
> then ever received an actual TCP pkt you would get some really
> odd behavior because seq numbers and rcv_nxt would be unrelated then.
> 
> The approach you have is really the best bet IMO, but mask out
> the increment msg_len where its not needed. Then it should be OK.
> 

I think we can add a flag to msg to identify whether msg comes from the same
sock's receive_queue. In this way, we can increase and decrease the msg_len
based on this flag when msg is queued to ingress_msg and when it is read by
the application.

And, this can also fix the case you mentioned above:

	"The other thing to consider is redirected skb into another socket
	and then read off the list increment the copied_seq even though
	they shouldn't if they came from another sock?  The result would
	be tcp_inq would be incorrect even negative perhaps?"

During recv in tcp_bpf_recvmsg_parser(), we only need to increment copied_seq
when the msg comes from the same sock's receive_queue, otherwise copied_seq
may overflow rcv_nxt in this case.

> Mixing ingress redirect and TCP sending/recv pkts doesn't usually work
> very well anyway but I still think leaving rcv_nxt alone is best.


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ