[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAF=yD-L3aXM17=hsJBoauWJ6Dqq16ykcnv8sg-Fn_Td_FsOafA@mail.gmail.com>
Date: Sat, 23 Sep 2023 08:59:25 +0200
From: Willem de Bruijn <willemdebruijn.kernel@...il.com>
To: David Howells <dhowells@...hat.com>
Cc: David Laight <David.Laight@...lab.com>, "David S. Miller" <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
Jens Axboe <axboe@...nel.dk>, Al Viro <viro@...iv.linux.org.uk>,
Linus Torvalds <torvalds@...ux-foundation.org>, Christoph Hellwig <hch@....de>,
Christian Brauner <christian@...uner.io>, Matthew Wilcox <willy@...radead.org>,
Jeff Layton <jlayton@...nel.org>, linux-fsdevel@...r.kernel.org,
linux-block@...r.kernel.org, linux-mm@...ck.org, netdev@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v5 00/11] iov_iter: Convert the iterator macros into
inline funcs
On Fri, Sep 22, 2023 at 2:01 PM David Howells <dhowells@...hat.com> wrote:
>
> David Laight <David.Laight@...LAB.COM> wrote:
>
> > > (8) Move the copy-and-csum code to net/ where it can be in proximity with
> > > the code that uses it. This eliminates the code if CONFIG_NET=n and
> > > allows for the slim possibility of it being inlined.
> > >
> > > (9) Fold memcpy_and_csum() in to its two users.
> > >
> > > (10) Move csum_and_copy_from_iter_full() out of line and merge in
> > > csum_and_copy_from_iter() since the former is the only caller of the
> > > latter.
> >
> > I thought that the real idea behind these was to do the checksum
> > at the same time as the copy to avoid loading the data into the L1
> > data-cache twice - especially for long buffers.
> > I wonder how often there are multiple iov[] that actually make
> > it better than just check summing the linear buffer?
>
> It also reduces the overhead for finding the data to checksum in the case the
> packet gets split since we're doing the checksumming as we copy - but with a
> linear buffer, that's negligible.
>
> > I had a feeling that check summing of udp data was done during
> > copy_to/from_user, but the code can't be the copy-and-csum here
> > for that because it is missing support form odd-length buffers.
>
> Is there a bug there?
>
> > Intel x86 desktop chips can easily checksum at 8 bytes/clock
> > (But probably not with the current code!).
> > (I've got ~12 bytes/clock using adox and adcx but that loop
> > is entirely horrid and it would need run-time patching.
> > Especially since I think some AMD cpu execute them very slowly.)
> >
> > OTOH 'rep movs[bq]' copy will copy 16 bytes/clock (32 if the
> > destination is 32 byte aligned - it pretty much won't be).
> >
> > So you'd need a csum-and-copy loop that did 16 bytes every
> > three clocks to get the same throughput for long buffers.
> > In principle splitting the 'adc memory' into two instructions
> > is the same number of u-ops - but I'm sure I've tried to do
> > that and failed and the extra memory write can happen in
> > parallel with everything else.
> > So I don't think you'll get 16 bytes in two clocks - but you
> > might get it is three.
> >
> > OTOH for a cpu where memcpy is code loop summing the data in
> > the copy loop is likely to be a gain.
> >
> > But I suspect doing the checksum and copy at the same time
> > got 'all to complicated' to actually implement fully.
> > With most modern ethernet chips checksumming receive pacakets
> > does it really get used enough for the additional complexity?
>
> You may be right. That's more a question for the networking folks than for
> me. It's entirely possible that the checksumming code is just not used on
> modern systems these days.
>
> Maybe Willem can comment since he's the UDP maintainer?
Perhaps these days it is more relevant to embedded systems than high
end servers.
Powered by blists - more mailing lists