lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250402233805.464ed70e@pumpkin>
Date: Wed, 2 Apr 2025 23:38:05 +0100
From: David Laight <david.laight.linux@...il.com>
To: Stanislav Fomichev <stfomichev@...il.com>
Cc: Stefan Metzmacher <metze@...ba.org>, Breno Leitao <leitao@...ian.org>,
 Linus Torvalds <torvalds@...ux-foundation.org>, Jens Axboe
 <axboe@...nel.dk>, Pavel Begunkov <asml.silence@...il.com>, Jakub Kicinski
 <kuba@...nel.org>, Christoph Hellwig <hch@....de>, Karsten Keil
 <isdn@...ux-pingi.de>, Ayush Sawal <ayush.sawal@...lsio.com>, Andrew Lunn
 <andrew+netdev@...n.ch>, "David S. Miller" <davem@...emloft.net>, Eric
 Dumazet <edumazet@...gle.com>, Paolo Abeni <pabeni@...hat.com>, Simon
 Horman <horms@...nel.org>, Kuniyuki Iwashima <kuniyu@...zon.com>, Willem de
 Bruijn <willemb@...gle.com>, David Ahern <dsahern@...nel.org>, Marcelo
 Ricardo Leitner <marcelo.leitner@...il.com>, Xin Long
 <lucien.xin@...il.com>, Neal Cardwell <ncardwell@...gle.com>, Joerg Reuter
 <jreuter@...na.de>, Marcel Holtmann <marcel@...tmann.org>, Johan Hedberg
 <johan.hedberg@...il.com>, Luiz Augusto von Dentz <luiz.dentz@...il.com>,
 Oliver Hartkopp <socketcan@...tkopp.net>, Marc Kleine-Budde
 <mkl@...gutronix.de>, Robin van der Gracht <robin@...tonic.nl>, Oleksij
 Rempel <o.rempel@...gutronix.de>, kernel@...gutronix.de, Alexander Aring
 <alex.aring@...il.com>, Stefan Schmidt <stefan@...enfreihafen.org>, Miquel
 Raynal <miquel.raynal@...tlin.com>, Alexandra Winter
 <wintera@...ux.ibm.com>, Thorsten Winkler <twinkler@...ux.ibm.com>, James
 Chapman <jchapman@...alix.com>, Jeremy Kerr <jk@...econstruct.com.au>, Matt
 Johnston <matt@...econstruct.com.au>, Matthieu Baerts <matttbe@...nel.org>,
 Mat Martineau <martineau@...nel.org>, Geliang Tang <geliang@...nel.org>,
 Krzysztof Kozlowski <krzk@...nel.org>, Remi Denis-Courmont
 <courmisch@...il.com>, Allison Henderson <allison.henderson@...cle.com>,
 David Howells <dhowells@...hat.com>, Marc Dionne
 <marc.dionne@...istor.com>, Wenjia Zhang <wenjia@...ux.ibm.com>, Jan
 Karcher <jaka@...ux.ibm.com>, "D. Wythe" <alibuda@...ux.alibaba.com>, Tony
 Lu <tonylu@...ux.alibaba.com>, Wen Gu <guwen@...ux.alibaba.com>, Jon Maloy
 <jmaloy@...hat.com>, Boris Pismenny <borisp@...dia.com>, John Fastabend
 <john.fastabend@...il.com>, Stefano Garzarella <sgarzare@...hat.com>,
 Martin Schiller <ms@....tdt.de>, Björn Töpel
 <bjorn@...nel.org>, Magnus Karlsson <magnus.karlsson@...el.com>, Maciej
 Fijalkowski <maciej.fijalkowski@...el.com>, Jonathan Lemon
 <jonathan.lemon@...il.com>, Alexei Starovoitov <ast@...nel.org>, Daniel
 Borkmann <daniel@...earbox.net>, Jesper Dangaard Brouer <hawk@...nel.org>,
 netdev@...r.kernel.org, linux-kernel@...r.kernel.org,
 linux-sctp@...r.kernel.org, linux-hams@...r.kernel.org,
 linux-bluetooth@...r.kernel.org, linux-can@...r.kernel.org,
 dccp@...r.kernel.org, linux-wpan@...r.kernel.org,
 linux-s390@...r.kernel.org, mptcp@...ts.linux.dev,
 linux-rdma@...r.kernel.org, rds-devel@....oracle.com,
 linux-afs@...ts.infradead.org, tipc-discussion@...ts.sourceforge.net,
 virtualization@...ts.linux.dev, linux-x25@...r.kernel.org,
 bpf@...r.kernel.org, isdn4linux@...tserv.isdn4linux.de,
 io-uring@...r.kernel.org
Subject: Re: [RFC PATCH 0/4] net/io_uring: pass a kernel pointer via
 optlen_t to proto[_ops].getsockopt()

On Wed, 2 Apr 2025 14:21:35 -0700
Stanislav Fomichev <stfomichev@...il.com> wrote:

> On 04/02, David Laight wrote:
> > On Wed, 2 Apr 2025 07:19:46 -0700
> > Stanislav Fomichev <stfomichev@...il.com> wrote:
> >   
> > > On 04/02, David Laight wrote:  
> > > > On Wed, 2 Apr 2025 00:53:58 +0200
> > > > Stefan Metzmacher <metze@...ba.org> wrote:
> > > >     
> > > > > Am 02.04.25 um 00:04 schrieb Stanislav Fomichev:    
> > > > > > On 04/01, Stefan Metzmacher wrote:      
> > > > > >> Am 01.04.25 um 17:45 schrieb Stanislav Fomichev:      
> > > > > >>> On 04/01, Breno Leitao wrote:      
> > > > > >>>> On Tue, Apr 01, 2025 at 03:48:58PM +0200, Stefan Metzmacher wrote:      
> > > > > >>>>> Am 01.04.25 um 15:37 schrieb Stefan Metzmacher:      
> > > > > >>>>>> Am 01.04.25 um 10:19 schrieb Stefan Metzmacher:      
> > > > > >>>>>>> Am 31.03.25 um 23:04 schrieb Stanislav Fomichev:      
> > > > > >>>>>>>> On 03/31, Stefan Metzmacher wrote:      
> > > > > >>>>>>>>> The motivation for this is to remove the SOL_SOCKET limitation
> > > > > >>>>>>>>> from io_uring_cmd_getsockopt().
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> The reason for this limitation is that io_uring_cmd_getsockopt()
> > > > > >>>>>>>>> passes a kernel pointer as optlen to do_sock_getsockopt()
> > > > > >>>>>>>>> and can't reach the ops->getsockopt() path.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> The first idea would be to change the optval and optlen arguments
> > > > > >>>>>>>>> to the protocol specific hooks also to sockptr_t, as that
> > > > > >>>>>>>>> is already used for setsockopt() and also by do_sock_getsockopt()
> > > > > >>>>>>>>> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT().
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> But as Linus don't like 'sockptr_t' I used a different approach.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> @Linus, would that optlen_t approach fit better for you?      
> > > > > >>>>>>>>
> > > > > >>>>>>>> [..]
> > > > > >>>>>>>>      
> > > > > >>>>>>>>> Instead of passing the optlen as user or kernel pointer,
> > > > > >>>>>>>>> we only ever pass a kernel pointer and do the
> > > > > >>>>>>>>> translation from/to userspace in do_sock_getsockopt().      
> > > > > >>>>>>>>
> > > > > >>>>>>>> At this point why not just fully embrace iov_iter? You have the size
> > > > > >>>>>>>> now + the user (or kernel) pointer. Might as well do
> > > > > >>>>>>>> s/sockptr_t/iov_iter/ conversion?      
> > > > > >>>>>>>
> > > > > >>>>>>> I think that would only be possible if we introduce
> > > > > >>>>>>> proto[_ops].getsockopt_iter() and then convert the implementations
> > > > > >>>>>>> step by step. Doing it all in one go has a lot of potential to break
> > > > > >>>>>>> the uapi. I could try to convert things like socket, ip and tcp myself, but
> > > > > >>>>>>> the rest needs to be converted by the maintainer of the specific protocol,
> > > > > >>>>>>> as it needs to be tested. As there are crazy things happening in the existing
> > > > > >>>>>>> implementations, e.g. some getsockopt() implementations use optval as in and out
> > > > > >>>>>>> buffer.
> > > > > >>>>>>>
> > > > > >>>>>>> I first tried to convert both optval and optlen of getsockopt to sockptr_t,
> > > > > >>>>>>> and that showed that touching the optval part starts to get complex very soon,
> > > > > >>>>>>> see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1
> > > > > >>>>>>> (note it didn't converted everything, I gave up after hitting
> > > > > >>>>>>> sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs.
> > > > > >>>>>>> sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe
> > > > > >>>>>>> more are the ones also doing both copy_from_user and copy_to_user on optval)
> > > > > >>>>>>>
> > > > > >>>>>>> I come also across one implementation that returned -ERANGE because *optlen was
> > > > > >>>>>>> too short and put the required length into *optlen, which means the returned
> > > > > >>>>>>> *optlen is larger than the optval buffer given from userspace.
> > > > > >>>>>>>
> > > > > >>>>>>> Because of all these strange things I tried to do a minimal change
> > > > > >>>>>>> in order to get rid of the io_uring limitation and only converted
> > > > > >>>>>>> optlen and leave optval as is.
> > > > > >>>>>>>
> > > > > >>>>>>> In order to have a patchset that has a low risk to cause regressions.
> > > > > >>>>>>>
> > > > > >>>>>>> But as alternative introducing a prototype like this:
> > > > > >>>>>>>
> > > > > >>>>>>>            int (*getsockopt_iter)(struct socket *sock, int level, int optname,
> > > > > >>>>>>>                                   struct iov_iter *optval_iter);
> > > > > >>>>>>>
> > > > > >>>>>>> That returns a non-negative value which can be placed into *optlen
> > > > > >>>>>>> or negative value as error and *optlen will not be changed on error.
> > > > > >>>>>>> optval_iter will get direction ITER_DEST, so it can only be written to.
> > > > > >>>>>>>
> > > > > >>>>>>> Implementations could then opt in for the new interface and
> > > > > >>>>>>> allow do_sock_getsockopt() work also for the io_uring case,
> > > > > >>>>>>> while all others would still get -EOPNOTSUPP.
> > > > > >>>>>>>
> > > > > >>>>>>> So what should be the way to go?      
> > > > > >>>>>>
> > > > > >>>>>> Ok, I've added the infrastructure for getsockopt_iter, see below,
> > > > > >>>>>> but the first part I wanted to convert was
> > > > > >>>>>> tcp_ao_copy_mkts_to_user() and that also reads from userspace before
> > > > > >>>>>> writing.
> > > > > >>>>>>
> > > > > >>>>>> So we could go with the optlen_t approach, or we need
> > > > > >>>>>> logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one
> > > > > >>>>>> with ITER_DEST...
> > > > > >>>>>>
> > > > > >>>>>> So who wants to decide?      
> > > > > >>>>>
> > > > > >>>>> I just noticed that it's even possible in same cases
> > > > > >>>>> to pass in a short buffer to optval, but have a longer value in optlen,
> > > > > >>>>> hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen.
> > > > > >>>>>
> > > > > >>>>> This makes it really hard to believe that trying to use iov_iter for this
> > > > > >>>>> is a good idea :-(      
> > > > > >>>>
> > > > > >>>> That was my finding as well a while ago, when I was planning to get the
> > > > > >>>> __user pointers converted to iov_iter. There are some weird ways of
> > > > > >>>> using optlen and optval, which makes them non-trivial to covert to
> > > > > >>>> iov_iter.      
> > > > > >>>
> > > > > >>> Can we ignore all non-ip/tcp/udp cases for now? This should cover +90%
> > > > > >>> of useful socket opts. See if there are any obvious problems with them
> > > > > >>> and if not, try converting. The rest we can cover separately when/if
> > > > > >>> needed.      
> > > > > >>
> > > > > >> That's what I tried, but it fails with
> > > > > >> tcp_getsockopt ->
> > > > > >>     do_tcp_getsockopt ->
> > > > > >>       tcp_ao_get_mkts ->
> > > > > >>          tcp_ao_copy_mkts_to_user ->
> > > > > >>             copy_struct_from_sockptr
> > > > > >>       tcp_ao_get_sock_info ->
> > > > > >>          copy_struct_from_sockptr
> > > > > >>
> > > > > >> That's not possible with a ITER_DEST iov_iter.
> > > > > >>
> > > > > >> metze      
> > > > > > 
> > > > > > Can we create two iterators over the same memory? One for ITER_SOURCE and
> > > > > > another for ITER_DEST. And then make getsockopt_iter accept optval_in and
> > > > > > optval_out. We can also use optval_out position (iov_offset) as optlen output
> > > > > > value. Don't see why it won't work, but I agree that's gonna be a messy
> > > > > > conversion so let's see if someone else has better suggestions.      
> > > > > 
> > > > > Yes, that might work, but it would be good to get some feedback
> > > > > if this would be the way to go:
> > > > > 
> > > > >            int (*getsockopt_iter)(struct socket *sock,
> > > > > 				 int level, int optname,
> > > > > 				 struct iov_iter *optval_in,
> > > > > 				 struct iov_iter *optval_out);
> > > > > 
> > > > > And *optlen = optval_out->iov_offset;
> > > > > 
> > > > > Any objection or better ideas? Linus would that be what you had in mind?    
> > > > 
> > > > I'd worry about performance - yes I know 'iter' are used elsewhere but...
> > > > Also look at the SCTP code.    
> > > 
> > > Performance usually does not matter for set/getsockopts, there
> > > are a few exceptions that I know (TCP_ZEROCOPY_RECEIVE)  
> > 
> > That might be the one that is really horrid and completely abuses
> > the 'length' parameter.  
> 
> It is reading and writing, yes, but it's not a huge problem. And it
> does enforce the optlen (to copy back the same amount of bytes). It's
> not that bad, it's just an example of where we need to be extra
> careful.
> 
> > > and maybe recent
> > > devmem sockopts; we can special-case these if needed, or keep sockptr_t,
> > > idk. I'm skeptical we can convert everything though, that's why the
> > > suggestion to start with sk/ip/tcp/udp.
> > >   
> > > > How do you handle code that wants to return an updated length (often longer
> > > > than the one provided) and an error code (eg ERRSIZE or similar).
> > > >
> > > > There is also a very strange use (I think it is a sockopt rather than an ioctl)
> > > > where the buffer length the application provides is only that of the header.
> > > > The actual buffer length is contained in the header.
> > > > The return length is the amount written into the full buffer.    
> > > 
> > > Let's discuss these special cases as they come up? Worst case these
> > > places can always re-init iov_iter with a comment on why it is ok.
> > > But I do agree in general that there are a few places that do wild
> > > stuff.  
> > 
> > The problem is that the generic code has to deal with all the 'wild stuff'.  
> 
> getsockopt_iter will have optval_in for the minority of socket options
> (like TCP_ZEROCOPY_RECEIVE) that want to read user's value as well
> as optval_out. The latter is what the majority of socket options
> will use to write their value. That doesn't seem too complicated to
> handle?
> 
> > It is also common to do non-sequential accesses - so iov_iter doesn't match
> > at all.  
> 
> I disagree that it's 'common'. Searching for copy_from_sockptr_offset
> returns a few cases and they are mostly using read-with-offset because
> there is no sequential read (iterator) semantics with sockptr_t.
> 
> > There also isn't a requirement for scatter-gather.
> > 
> > For 'normal' getsockopt (and setsockopt) with short lengths it actually makes
> > sense for the syscall wrapper to do the user copies.
> > But it would need to pass the user ptr+len as well as the kernel ptr+len
> > to give the required flexibilty.
> > Then you have to work out whether the final copy to user is needed or not.
> > (not that hard, but it all adds complication).  
> 
> Not sure I understand what's the problem. The user vs kernel part will
> be abstracted by iov_iter. The callers will have to write the optlen
> back. And there are two call sites we care about: io_uring and regular
> system call. What's your suggestion? Maybe I'm missing something. Do you
> prefer get_optlen/put_optlen?

I think the final aim should be to pass the user supplied length to the
per-protocol code and have it return the length/error to be passed back to the
user.

But in a lot of cases the syscall wrapper can do the buffer copies (as well
as the length copies).
That would be restricted to short length (on stack).
So code that needed a long buffer (like some of the sctp options)
would need to directly access the user buffer (or a long buffer provided
by an in-kernel user).

But you'll find code that reads/writes well beyond the apparent size of
the user buffer.
(And not just code that accesses 4 bytes without checking the length).

	David




Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ