linux-kernel - Re: [mm/contpte v3 1/1] mm/contpte: Optimize loop to reduce redundant operations

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3d338f91.8c71.1965cd8b1b8.Coremail.xavier_qy@163.com>
Date: Tue, 22 Apr 2025 17:33:47 +0800 (CST)
From: Xavier  <xavier_qy@....com>
To: "Ryan Roberts" <ryan.roberts@....com>, dev.jain@....com,
	ioworker0@...il.com, 21cnbao@...il.com
Cc: akpm@...ux-foundation.org, catalin.marinas@....com, david@...hat.com,
	gshan@...hat.com, linux-arm-kernel@...ts.infradead.org,
	linux-kernel@...r.kernel.org, will@...nel.org, willy@...radead.org,
	ziy@...dia.com
Subject: Re: [mm/contpte v3 1/1] mm/contpte: Optimize loop to reduce
 redundant operations


Hi all,


At 2025-04-16 20:54:47, "Ryan Roberts" <ryan.roberts@....com> wrote:
>On 15/04/2025 09:22, Xavier wrote:
>> This commit optimizes the contpte_ptep_get function by adding early
>>  termination logic. It checks if the dirty and young bits of orig_pte
>>  are already set and skips redundant bit-setting operations during
>>  the loop. This reduces unnecessary iterations and improves performance.
>> 
>> Signed-off-by: Xavier <xavier_qy@....com>
>> ---
>>  arch/arm64/mm/contpte.c | 20 ++++++++++++++++++--
>>  1 file changed, 18 insertions(+), 2 deletions(-)
>> 
>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> index bcac4f55f9c1..0acfee604947 100644
>> --- a/arch/arm64/mm/contpte.c
>> +++ b/arch/arm64/mm/contpte.c
>> @@ -152,6 +152,16 @@ void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>>  }
>>  EXPORT_SYMBOL_GPL(__contpte_try_unfold);
>>  
>> +/* Note: in order to improve efficiency, using this macro will modify the
>> + * passed-in parameters.*/
>> +#define CHECK_CONTPTE_FLAG(start, ptep, orig_pte, flag) \
>> +    for (; (start) < CONT_PTES; (start)++, (ptep)++) { \
>> +		if (pte_##flag(__ptep_get(ptep))) { \
>> +				orig_pte = pte_mk##flag(orig_pte); \
>> +				break; \
>> +		} \
>> +    }
>
>I'm really not a fan of this macro, it just obfuscates what is going on. I'd
>personally prefer to see the 2 extra loops open coded below.
>
>Or even better, could you provide results comparing this 3 loop version to the
>simpler approach I suggested previously? If the performance is similar (which I
>expect it will be, especially given Barry's point that your test always ensures
>the first PTE is both young and dirty) then I'd prefer to go with the simpler code.
>

Based on the discussions in the previous email, two modifications were adopted
and tested, and the results are as follows:

Modification 1

pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
{
	pte_t pte;
	int i;

	ptep = contpte_align_down(ptep);

	for (i = 0; i < CONT_PTES; i++, ptep++) {
		pte = __ptep_get(ptep);

		if (pte_dirty(pte)) {
			orig_pte = pte_mkdirty(orig_pte);
			if (pte_young(orig_pte))
				break;
		}

		if (pte_young(pte)) {
			orig_pte = pte_mkyoung(orig_pte);
			if (pte_dirty(orig_pte))
				break;
		}
	}

	return orig_pte;
}

Modification 2

pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
{
	pte_t pte;
	int i;

	ptep = contpte_align_down(ptep);

	for (i = 0; i < CONT_PTES; i++, ptep++) {
		pte = __ptep_get(ptep);

		if (pte_dirty(pte)) {
			orig_pte = pte_mkdirty(orig_pte);
			for (; i < CONT_PTES; i++, ptep++) {
				pte = __ptep_get(ptep);
				if (pte_young(pte)) {
					orig_pte = pte_mkyoung(orig_pte);
					break;
				}
			}
			break;
		}

		if (pte_young(pte)) {
			orig_pte = pte_mkyoung(orig_pte);
			i++;
			ptep++;
			for (; i < CONT_PTES; i++, ptep++) {
				pte = __ptep_get(ptep);
				if (pte_dirty(pte)) {
					orig_pte = pte_mkdirty(orig_pte);
					break;
				}
			}
			break;
		}
	}

	return orig_pte;
}

Test Code:

#define PAGE_SIZE 4096
#define CONT_PTES 16
#define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)
#define YOUNG_BIT 8
void rwdata(char *buf)
{
	for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
		buf[i] = 'a';
		volatile char c = buf[i];
	}
}
void clear_young_dirty(char *buf)
{
	if (madvise(buf, TEST_SIZE, MADV_FREE) == -1) {
		perror("madvise free failed");
		free(buf);
		exit(EXIT_FAILURE);
	}
	if (madvise(buf, TEST_SIZE, MADV_COLD) == -1) {
		perror("madvise free failed");
		free(buf);
		exit(EXIT_FAILURE);
	}
}
void set_one_young(char *buf)
{
	for (size_t i = 0; i < TEST_SIZE; i += CONT_PTES * PAGE_SIZE) {
		volatile char c = buf[i + YOUNG_BIT * PAGE_SIZE];
	}
}

void test_contpte_perf() {
	char *buf;
	int ret = posix_memalign((void **)&buf, CONT_PTES * PAGE_SIZE, TEST_SIZE);
	if ((ret != 0) || ((unsigned long)buf % CONT_PTES * PAGE_SIZE)) {
		perror("posix_memalign failed");
		exit(EXIT_FAILURE);
	}

	rwdata(buf);
#if TEST_CASE2 || TEST_CASE3
	clear_young_dirty(buf);
#endif
#if TEST_CASE2
	set_one_young(buf);
#endif

	for (int j = 0; j < 500; j++) {
		mlock(buf, TEST_SIZE);

		munlock(buf, TEST_SIZE);
	}
	free(buf);
}
---

Descriptions of three test scenarios

Scenario 1
The data of all 16 PTEs are both dirty and young.
#define TEST_CASE2 0
#define TEST_CASE3 0

Scenario 2
Among the 16 PTEs, only the 8th one is young, and there are no dirty ones.
#define TEST_CASE2 1
#define TEST_CASE3 0

Scenario 3
Among the 16 PTEs, there are neither young nor dirty ones.
#define TEST_CASE2 0
#define TEST_CASE3 1


Test results

|Scenario 1         |       Original|  Modification 1|  Modification 2|
|-------------------|---------------|----------------|----------------|
|instructions       |    37912436160|     18303833386|     18731580031|
|test time          |         4.2797|          2.2687|          2.2949|
|overhead of        |               |                |                |
|contpte_ptep_get() |         21.31%|           4.72%|           4.80%|

|Scenario 2         |       Original|  Modification 1|  Modification 2|
|-------------------|---------------|----------------|----------------|
|instructions       |    36701270862|     38729716276|     36115790086|
|test time          |         3.2335|          3.5732|          3.0874|
|Overhead of        |               |                |                |
|contpte_ptep_get() |         32.26%|          41.35%|          33.57%|

|Scenario 3         |       Original|  Modification 1|  Modification 2|
|-------------------|---------------|----------------|----------------|
|instructions       |    36706279735|     38305241759|     36750881878|
|test time          |         3.2008|          3.5389|          3.1249|
|Overhead of        |               |                |                |
|contpte_ptep_get() |         31.94%|          41.30%|          34.59%|


For Scenario 1, Modification 1 can achieve an instruction count benefit of
51.72% and a time benefit of 46.99%. Modification 2 can achieve an instruction
benefit of 50.59% and a time benefit of 46.38%.

For Scenarios 2, Modification 2 can achieve an instruction count benefit of
1.6% and a time benefit of 4.5%. while Modification 1 significantly increases
the instructions and time due to additional conditional checks.

For Scenario 3, since all the PTEs have neither the young nor the dirty flag,
the branches taken by Modification 1 and Modification 2 should be the same as
those of the original code. In fact, the test results of Modification 2 seem
to be closer to those of the original code. I don't know why there is a
performance regression in Modification 1.

Therefore, I believe modifying the code according to Modification 2 can bring
maximum benefits. Everyone can discuss whether this approach is acceptable,
and if so, I will send Patch V4 to proceed with submitting this modification.


--
Thanks,
Xavier