[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGudoHGr6y-WQY9CZ7mppeX87cgN0dG07ivK+MaoUow3ymArDw@mail.gmail.com>
Date: Thu, 20 Mar 2025 15:35:51 +0100
From: Mateusz Guzik <mjguzik@...il.com>
To: "Herton R. Krzesinski" <herton@...hat.com>
Cc: x86@...nel.org, tglx@...utronix.de, mingo@...hat.com, bp@...en8.de,
dave.hansen@...ux.intel.com, hpa@...or.com, linux-kernel@...r.kernel.org,
torvalds@...ux-foundation.org, olichtne@...hat.com, atomasov@...hat.com,
aokuliar@...hat.com
Subject: Re: [PATCH] x86: write aligned to 8 bytes in copy_user_generic (when
without FSRM/ERMS)
On Thu, Mar 20, 2025 at 3:22 PM Herton R. Krzesinski <herton@...hat.com> wrote:
> diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
> index fc9fb5d06174..b8f74d80f35c 100644
> --- a/arch/x86/lib/copy_user_64.S
> +++ b/arch/x86/lib/copy_user_64.S
> @@ -74,6 +74,24 @@ SYM_FUNC_START(rep_movs_alternative)
> _ASM_EXTABLE_UA( 0b, 1b)
>
> .Llarge_movsq:
> + /* Do the first possibly unaligned word */
> +0: movq (%rsi),%rax
> +1: movq %rax,(%rdi)
> +
> + _ASM_EXTABLE_UA( 0b, .Lcopy_user_tail)
> + _ASM_EXTABLE_UA( 1b, .Lcopy_user_tail)
> +
> + /* What would be the offset to the aligned destination? */
> + leaq 8(%rdi),%rax
> + andq $-8,%rax
> + subq %rdi,%rax
> +
> + /* .. and update pointers and count to match */
> + addq %rax,%rdi
> + addq %rax,%rsi
> + subq %rax,%rcx
> +
> + /* make %rcx contain the number of words, %rax the remainder */
> movq %rcx,%rax
> shrq $3,%rcx
> andl $7,%eax
The patch looks fine to me, but there is more to do if you are up for it.
It was quite some time since I last seriously played with the area and
I don't remember all the details, on top of that realities of uarchs
probably improved.
That said, have you experimented with aligning the target to 16 bytes
or more bytes?
Moreover, I have some recollection that there were uarchs with ERMS
which also liked the target to be aligned -- as in perhaps this should
be done regardless of FSRM?
And most importantly memset, memcpy and clear_user would all use a
revamp and they are missing rep handling for bigger sizes (I verified
they *do* show up). Not only that, but memcpy uses overlapping stores
while memset just loops over stuff.
I intended to sort it out long time ago and maybe will find some time
now that I got reminded of it, but I would be deligthed if it got
picked up.
Hacking this up is just some screwing around, the real time consuming
part is the benchmarking so I completely understand if you are not
interested.
--
Mateusz Guzik <mjguzik gmail.com>
Powered by blists - more mailing lists