linux-kernel - Re: [PATCH v3 2/2] iov_iter: Don't deal with iter->copy_mc in memcpy_from_iter

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2058762.1692371971@warthog.procyon.org.uk>
Date:   Fri, 18 Aug 2023 16:19:31 +0100
From:   David Howells <dhowells@...hat.com>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     dhowells@...hat.com, David Laight <David.Laight@...lab.com>,
        Al Viro <viro@...iv.linux.org.uk>,
        Jens Axboe <axboe@...nel.dk>, Christoph Hellwig <hch@...t.de>,
        Christian Brauner <christian@...uner.io>,
        Matthew Wilcox <willy@...radead.org>,
        Jeff Layton <jlayton@...nel.org>,
        "linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
        "linux-block@...r.kernel.org" <linux-block@...r.kernel.org>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v3 2/2] iov_iter: Don't deal with iter->copy_mc in memcpy_from_iter_mc()

Linus Torvalds <torvalds@...ux-foundation.org> wrote:

> > Although I'm not sure the bit-fields really help.
> > There are 8 bytes at the start of the structure, might as well
> > use them :-)
> 
> Actuallyç I wrote the patch that way because it seems to improve code
> generation.
> 
> The bitfields are generally all set together as just plain one-time
> constants at initialization time, and gcc sees that it's a full byte
> write. And the reason 'data_source' is not a bitfield is that it's not
> a constant at iov_iter init time (it's an argument to all the init
> functions), so having that one as a separate byte at init time is good
> for code generation when you don't need to mask bits or anything like
> that.
> 
> And once initialized, having things be dense and doing all the
> compares with a bitwise 'and' instead of doing them as some value
> compare again tends to generate good code.

Actually...  I said that switch(enum) seemed to generate suboptimal code...
However, if the enum is renumbered such that the constants are in the same
order as in the switch() it generates better code.

So we want this order:

	enum iter_type {
		ITER_UBUF,
		ITER_IOVEC,
		ITER_BVEC,
		ITER_KVEC,
		ITER_XARRAY,
		ITER_DISCARD,
	};

to match:

	switch (iov_iter_type(iter)) {
	case ITER_UBUF:
		progress = iterate_ubuf(iter, len, priv, priv2, ustep);
		break;
	case ITER_IOVEC:
		progress = iterate_iovec(iter, len, priv, priv2, ustep);
		break;
	case ITER_BVEC:
		progress = iterate_bvec(iter, len, priv, priv2, step);
		break;
	case ITER_KVEC:
		progress = iterate_kvec(iter, len, priv, priv2, step);
		break;
	case ITER_XARRAY:
		progress = iterate_xarray(iter, len, priv, priv2, step);
		break;
	case ITER_DISCARD:
	default:
		progress = len;
		break;
	}

then gcc should be able to generate a ternary tree, which it does here:

	<+77>:	mov    (%rdx),%al
	<+79>:	cmp    $0x2,%al
	<+81>:	je     0xffffffff81779bcc <_copy_from_iter+394>
	<+87>:	ja     0xffffffff81779aa9 <_copy_from_iter+103>

though it really split the number space in the wrong place.  It can then use
one CMP (or TEST) to split again:

	<+89>:	test   %al,%al
	<+91>:	mov    0x8(%rdx),%rdx
	<+95>:	jne    0xffffffff81779b48 <_copy_from_iter+262>
	<+101>:	jmp    0xffffffff81779b17 <_copy_from_iter+213>

It then should only require one CMP at this point, since AL can only be 4, 5
or 6+:

	<+103>:	cmp    $0x3,%al
	<+105>:	je     0xffffffff81779c6e <_copy_from_iter+556>
	<+111>:	cmp    $0x4,%al
	<+113>:	jne    0xffffffff81779dd2 <_copy_from_iter+912>

The end result being that it should have at most two CMP instructions for any
number in the range 0 to 5 - though in this case, it will have three for AL>3.

However, it doesn't do this with if-if-if with a number sequence or a bitmask,
but rather generates an chain of cmp-cmp-cmp or test-test-test, presumably
because it fails to spot the if-conditions are related.

I should log a gcc bug on this on the poor switch() behaviour.

Also, if we renumber the enum to put UBUF and IOVEC first, we can get rid of
iter->user_backed in favour of:

	static inline bool user_backed_iter(const struct iov_iter *i)
	{
		return iter_is_ubuf(i) || iter_is_iovec(i);
	}

which gcc just changes into something like a "CMP $1" and a "JA".


Comparing Linus's bit patch (+ is better) to renumbering the switch (- is
better):

__iov_iter_get_pages_alloc               inc 0x32a -> 0x331 +0x7
_copy_from_iter                          dcr 0x3c7 -> 0x3bf -0x8
_copy_from_iter_flushcache               inc 0x364 -> 0x370 +0xc
_copy_from_iter_nocache                  inc 0x33e -> 0x347 +0x9
_copy_mc_to_iter                         dcr 0x3bc -> 0x3b6 -0x6
_copy_to_iter                            inc 0x34a -> 0x34b +0x1
copy_page_from_iter_atomic.part.0        dcr 0x424 -> 0x41c -0x8
copy_page_to_iter_nofault.part.0         dcr 0x3a9 -> 0x3a5 -0x4
csum_and_copy_from_iter                  inc 0x3e5 -> 0x3e8 +0x3
csum_and_copy_to_iter                    dcr 0x449 -> 0x448 -0x1
dup_iter                                 inc 0x34 -> 0x37 +0x3
fault_in_iov_iter_readable               dcr 0x9c -> 0x9a -0x2
fault_in_iov_iter_writeable              dcr 0x9c -> 0x9a -0x2
iov_iter_advance                         dcr 0xde -> 0xd8 -0x6
iov_iter_alignment                       inc 0xe0 -> 0xe3 +0x3
iov_iter_extract_pages                   inc 0x416 -> 0x418 +0x2
iov_iter_gap_alignment                   dcr 0x69 -> 0x67 -0x2
iov_iter_init                            inc 0x27 -> 0x31 +0xa
iov_iter_is_aligned                      dcr 0x104 -> 0xf5 -0xf
iov_iter_npages                          inc 0x119 -> 0x11d +0x4
iov_iter_revert                          dcr 0x93 -> 0x86 -0xd
iov_iter_single_seg_count                dcr 0x41 -> 0x3c -0x5
iov_iter_ubuf                            inc 0x39 -> 0x3a +0x1
iov_iter_zero                            dcr 0x338 -> 0x32e -0xa
memcpy_from_iter_mc                      inc 0x2a -> 0x2b +0x1

I think there may be more savings to be made if I go and convert more of the
functions to using switch().

David