linux-kernel - Re: [PATCH v11 09/12] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <fab1dd64-c652-4160-93b4-7b483a8874da@intel.com>
Date: Fri, 31 May 2024 09:12:42 -0700
From: Dave Hansen <dave.hansen@...el.com>
To: Byungchul Park <byungchul@...com>, linux-kernel@...r.kernel.org,
 linux-mm@...ck.org
Cc: kernel_team@...ynix.com, akpm@...ux-foundation.org, ying.huang@...el.com,
 vernhao@...cent.com, mgorman@...hsingularity.net, hughd@...gle.com,
 willy@...radead.org, david@...hat.com, peterz@...radead.org,
 luto@...nel.org, tglx@...utronix.de, mingo@...hat.com, bp@...en8.de,
 dave.hansen@...ux.intel.com, rjgolo@...il.com
Subject: Re: [PATCH v11 09/12] mm: implement LUF(Lazy Unmap Flush) defering
 tlb flush when folios get unmapped

On 5/31/24 02:19, Byungchul Park wrote:
..
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 0283cf366c2a..03683bf66031 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2872,6 +2872,12 @@ static inline void file_end_write(struct file *file)
>  	if (!S_ISREG(file_inode(file)->i_mode))
>  		return;
>  	sb_end_write(file_inode(file)->i_sb);
> +
> +	/*
> +	 * XXX: If needed, can be optimized by avoiding luf_flush() if
> +	 * the address space of the file has never been involved by luf.
> +	 */
> +	luf_flush();
>  }
..
> +void luf_flush(void)
> +{
> +	unsigned long flags;
> +	unsigned short int ugen;
> +
> +	/*
> +	 * Obtain the latest ugen number.
> +	 */
> +	spin_lock_irqsave(&luf_lock, flags);
> +	ugen = luf_gen;
> +	spin_unlock_irqrestore(&luf_lock, flags);
> +
> +	check_luf_flush(ugen);
> +}

Am I reading this right?  There's now an unconditional global spinlock
acquired in the sys_write() path?  How can this possibly scale?

So, yeah, I think an optimization is absolutely needed.  But, on a more
fundamental level, I just don't believe these patches are being tested.
Even a simple microbenchmark should show a pretty nasty regression on
any decently large system:

> https://github.com/antonblanchard/will-it-scale/blob/master/tests/write1.c

Second, I was just pointing out sys_write() as an example of how the
page cache could change.  Couldn't a separate, read/write mmap() of the
file do the same thing and *not* go through sb_end_write()?

So:

	fd = open("foo");
	ptr1 = mmap(fd, PROT_READ);
	ptr2 = mmap(fd, PROT_READ|PROT_WRITE);

	foo = *ptr1; // populate the page cache
	... page cache page is reclaimed and LUF'd
	*ptr2 = bar; // new page cache page is allocated and written to

	printk("*ptr1: %d\n", *ptr1);

Doesn't the printk() see stale data?

I think tglx would call all of this "tinkering".  The approach to this
series is to "fix" narrow, specific cases that reviewers point out, make
it compile, then send it out again, hoping someone will apply it.

So, for me, until the approach to this series changes: NAK, for x86.
Andrew, please don't take this series.  Or, if you do, please drop the
patch enabling it on x86.

I also have the feeling our VFS friends won't take kindly to having
random luf_foo() hooks in their hot paths, optimized or not.  I don't
see any of them on cc.