lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAMArcTUXm13xJO9XqcT=0uQAn_ZQOQ=Y49EPpHqV+jkkhihMcw@mail.gmail.com>
Date: Sun, 18 Aug 2024 01:51:09 +0900
From: Taehee Yoo <ap420073@...il.com>
To: Mina Almasry <almasrymina@...gle.com>
Cc: netdev@...r.kernel.org, linux-kernel@...r.kernel.org, 
	linux-doc@...r.kernel.org, linux-alpha@...r.kernel.org, 
	linux-mips@...r.kernel.org, linux-parisc@...r.kernel.org, 
	sparclinux@...r.kernel.org, linux-trace-kernel@...r.kernel.org, 
	linux-arch@...r.kernel.org, linux-kselftest@...r.kernel.org, 
	bpf@...r.kernel.org, linux-media@...r.kernel.org, 
	dri-devel@...ts.freedesktop.org, "David S. Miller" <davem@...emloft.net>, 
	Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, 
	Donald Hunter <donald.hunter@...il.com>, Jonathan Corbet <corbet@....net>, 
	Richard Henderson <richard.henderson@...aro.org>, Ivan Kokshaysky <ink@...assic.park.msu.ru>, 
	Matt Turner <mattst88@...il.com>, Thomas Bogendoerfer <tsbogend@...ha.franken.de>, 
	"James E.J. Bottomley" <James.Bottomley@...senpartnership.com>, Helge Deller <deller@....de>, 
	Andreas Larsson <andreas@...sler.com>, Jesper Dangaard Brouer <hawk@...nel.org>, 
	Ilias Apalodimas <ilias.apalodimas@...aro.org>, Steven Rostedt <rostedt@...dmis.org>, 
	Masami Hiramatsu <mhiramat@...nel.org>, Mathieu Desnoyers <mathieu.desnoyers@...icios.com>, 
	Arnd Bergmann <arnd@...db.de>, Steffen Klassert <steffen.klassert@...unet.com>, 
	Herbert Xu <herbert@...dor.apana.org.au>, David Ahern <dsahern@...nel.org>, 
	Willem de Bruijn <willemdebruijn.kernel@...il.com>, Shuah Khan <shuah@...nel.org>, 
	Alexei Starovoitov <ast@...nel.org>, Daniel Borkmann <daniel@...earbox.net>, 
	John Fastabend <john.fastabend@...il.com>, Sumit Semwal <sumit.semwal@...aro.org>, 
	Christian König <christian.koenig@....com>, 
	Bagas Sanjaya <bagasdotme@...il.com>, Christoph Hellwig <hch@...radead.org>, 
	Nikolay Aleksandrov <razor@...ckwall.org>, Pavel Begunkov <asml.silence@...il.com>, David Wei <dw@...idwei.uk>, 
	Jason Gunthorpe <jgg@...pe.ca>, Yunsheng Lin <linyunsheng@...wei.com>, 
	Shailend Chand <shailend@...gle.com>, Harshitha Ramamurthy <hramamurthy@...gle.com>, 
	Shakeel Butt <shakeel.butt@...ux.dev>, Jeroen de Borst <jeroendb@...gle.com>, 
	Praveen Kaligineedi <pkaligineedi@...gle.com>, Willem de Bruijn <willemb@...gle.com>, 
	Kaiyuan Zhang <kaiyuanz@...gle.com>
Subject: Re: [PATCH net-next v19 09/13] tcp: RX path for devmem TCP

On Sun, Aug 18, 2024 at 12:13 AM Mina Almasry <almasrymina@...gle.com> wrote:
>
> On Sat, Aug 17, 2024 at 9:58 AM Taehee Yoo <ap420073@...il.com> wrote:
> >
> > On Wed, Aug 14, 2024 at 6:13 AM Mina Almasry <almasrymina@...gle.com> wrote:
> > >
> >
> > Hi Mina,
> >
> > > In tcp_recvmsg_locked(), detect if the skb being received by the user
> > > is a devmem skb. In this case - if the user provided the MSG_SOCK_DEVMEM
> > > flag - pass it to tcp_recvmsg_devmem() for custom handling.
> > >
> > > tcp_recvmsg_devmem() copies any data in the skb header to the linear
> > > buffer, and returns a cmsg to the user indicating the number of bytes
> > > returned in the linear buffer.
> > >
> > > tcp_recvmsg_devmem() then loops over the unaccessible devmem skb frags,
> > > and returns to the user a cmsg_devmem indicating the location of the
> > > data in the dmabuf device memory. cmsg_devmem contains this information:
> > >
> > > 1. the offset into the dmabuf where the payload starts. 'frag_offset'.
> >
> > I have been testing this patch and I found a bug.
>
> Thanks Taehee. It's exciting to see that you have gotten this far in
> your testing!! You seem to have devmem TCP (almost) fully working!!

Yes, Finally!!

> May I ask which driver this is? I assume it's bnxt. Do you have the
> driver support somewhere on github or something? I'm curious what your
> driver implementation looks like.

I'm using a modified bnxt_en driver.
NICs are BCM57412, BCM57508(currently only 57412).
I modified the driver by myself for devmem TCP.
The implementation is too rough, only for testing it.

>
> > While testing it with the ncdevmem cmd, it fails to validate buffers
> > after some period.
> > This is because tcp_recvmsg_dmabuf() can't handle skb properly when
> > the parameter offset != 0.
>
> Sadly I'm unable to reproduce this issue, but I think I know where to
> suspect the bug is. Thanks for taking the time to root cause this and
> provide a fix.

I'm not sure why, but I assume this bug appears when CPU utilization
peeks to almost 100%.

>
> ...
>
> > > + offset = 0;
> >
> > If the offset is 5000 and only 4500 bytes are skipped at this point,
> > the offset should be 500, not 0.
> > We need to add a condition to set the offset correctly.
> >
>
> I highly suspect this is a regression that was introduced in v13. In
> v12 Pavel asked if offset can just be set to 0 here, and I didn't see
> any reason why not, so I made the change:
>
> -+ offset = offset - start;
> ++ offset = 0;
>
> It looks like we missed something. I suspect reverting that may
> resolve the issue, because __skb_copy_datagram() in earlier kernels
> modified offset like this and it's well tested. Can you test with this
> change reverted? Diff like so:

No problem, I will test it.

>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 40e7335dae6e..984e28c5d096 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -2522,7 +2522,7 @@ static int tcp_recvmsg_dmabuf(struct sock *sk,
> const struct sk_buff *skb,
> */
> skb = skb_shinfo(skb)->frag_list ?: skb->next;
>
> - offset = 0;
> + offset = offset - start;
> } while (skb);
>
> if (remaining_len) {
>
> I'm running a long test to try to reproduce this issue, but I have ran
> many long tests before and was not able to. For some reason my setup
> is not able to reproduce this edge case. Are you doing anything
> special with ncdevmem? Or simply running commands like these on the
> server client?
>
> server: ./ncdevmem -s SERVER -c CLIENT -l -p 5224 -v 7
> client: yes $(echo -e \\x01\\x02\\x03\\x04\\x05\\x06) | tr \\n \\0 |
> nc SERVER 5224 -p 5224
>

Yes, I'm using the vanilla ncdevmem command and option in the
documentation in this series.

server:
./ncdevmem -s 192.168.1.4 -c 192.168.1.2 -f $interface -l -p 5000 -v 7 -t 0 -q 4

client:
yes $(echo -e \\x01\\x02\\x03\\x04\\x05\\x06) | \
tr \\n \\0 | \
head -c 1000000G | \
nc 192.168.1.4 5000 -p 5000


BTW, this is a ncdevmem failure message of the current implementation.
(offset = 0)

received frag_page=13205, in_page_offset=0, frag_offset=54087680,
frag_size=2896, token=288, total_received=1153657976, dmabuf_id=2
Validated buffer
received frag_page=13204, in_page_offset=0, frag_offset=54083584,
frag_size=1448, token=289, total_received=1153659424, dmabuf_id=2
Validated buffer
[1] received frag_page=13203, in_page_offset=0, frag_offset=54079488,
frag_size=2896, token=290, total_received=1153662320, dmabuf_id=2
Validated buffer
received frag_page=13202, in_page_offset=0, frag_offset=54075392,
frag_size=200, token=291, total_received=1153662520, dmabuf_id=2
Validated buffer
total_received=1153662520


recvmsg ret=819200
[2] received frag_page=13203, in_page_offset=0, frag_offset=54079488,
frag_size=2896, token=1, total_received=1153665416, dmabuf_id=2
Failed validation: expected=4, actual=2, index=0
Failed validation: expected=5, actual=3, index=1
Failed validation: expected=6, actual=4, index=2
Failed validation: expected=0, actual=5, index=3
Failed validation: expected=1, actual=6, index=4
Failed validation: expected=2, actual=0, index=5
Failed validation: expected=3, actual=1, index=6
Failed validation: expected=4, actual=2, index=7
Failed validation: expected=5, actual=3, index=8
Failed validation: expected=6, actual=4, index=9
Failed validation: expected=0, actual=5, index=10
Failed validation: expected=1, actual=6, index=11
Failed validation: expected=2, actual=0, index=12
Failed validation: expected=3, actual=1, index=13
Failed validation: expected=4, actual=2, index=14
Failed validation: expected=5, actual=3, index=15
Failed validation: expected=6, actual=4, index=16
Failed validation: expected=0, actual=5, index=17
Failed validation: expected=1, actual=6, index=18
Failed validation: expected=2, actual=0, index=19
Failed validation: expected=3, actual=1, index=20
./ncdevmem: validation failed.

Please look at the [1] and [2].
At the [1], The 13203 page is fully passed to userspace.
But 13202 is not, only 200 bytes are passed to userspace.
The 13203 page is passed to userspace fully, but 13202 is not.
Only 200 bytes are passed to userspace.

But at the [2], it receives 2896 bytes from 13203 again.
It should be 13202 and in_page_offset=200.

And I just started testing your suggestion, it seems to work correctly.
Here is the ncdevmem message, it's not a failure message.
(offset = offset - start)

received frag_page=13085, in_page_offset=0, frag_offset=53596160,
frag_size=2896, token=288, total_received=2233699704, dmabuf_id=2
Validated buffer
received frag_page=12931, in_page_offset=0, frag_offset=52965376,
frag_size=2896, token=289, total_received=2233702600, dmabuf_id=2
Validated buffer
[1] received frag_page=12916, in_page_offset=0, frag_offset=52903936,
frag_size=1392, token=290, total_received=2233703992, dmabuf_id=2
Validated buffer
total_received=2233703992


recvmsg ret=819200
[2] received frag_page=12916, in_page_offset=1392,
frag_offset=52905328, frag_size=1504, token=1,
total_received=2233705496,
dmabuf_id=2
Validated buffer
received frag_page=13244, in_page_offset=0, frag_offset=54247424,
frag_size=2896, token=2, total_received=2233708392, dmabuf_id=2
Validated buffer
received frag_page=13579, in_page_offset=0, frag_offset=55619584,
frag_size=1448, token=3, total_received=2233709840, dmabuf_id=2
Validated buffer
received frag_page=12315, in_page_offset=0, frag_offset=50442240,
frag_size=2896, token=4, total_received=2233712736, dmabuf_id=2
Validated buffer

At the [1], the 12916 page was not passed to userspace fully.
Only 1392 bytes are received.
At the [2], remain 1504 bytes are passed and the offset is 1392.
So, the 12916 page size is 2896, so it makes sense.

So, this is the reason why I think your suggestion is working correctly.
I have been still testing it, so I will report if it fails while testing.
But I think it works well about this corner case so far.

>
> --
> Thanks,
> Mina

Thanks a lot,
Taehee Yoo

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ