linux-kernel - Re: [PATCH] swap: Add percpu cluster_next to reduce lock contention on swap cache

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <87zha5kauc.fsf@yhuang-dev.intel.com>
Date:   Mon, 18 May 2020 14:37:15 +0800
From:   "Huang\, Ying" <ying.huang@...el.com>
To:     Daniel Jordan <daniel.m.jordan@...cle.com>
Cc:     Andrew Morton <akpm@...ux-foundation.org>, <linux-mm@...ck.org>,
        <linux-kernel@...r.kernel.org>, Michal Hocko <mhocko@...e.com>,
        Minchan Kim <minchan@...nel.org>,
        Tim Chen <tim.c.chen@...ux.intel.com>,
        Hugh Dickins <hughd@...gle.com>
Subject: Re: [PATCH] swap: Add percpu cluster_next to reduce lock contention on swap cache

Daniel Jordan <daniel.m.jordan@...cle.com> writes:

> On Thu, May 14, 2020 at 03:04:24PM +0800, Huang Ying wrote:
>> And the pmbench score increases 15.9%.
>
> What metric is that, and how long did you run the benchmark for?

I run the benchmark for 1800s.  The metric comes from the following
output of the pmbench,

[1] Benchmark done - took 1800.088 sec for 122910000 page access

That is, the throughput is 122910000 / 1800.088 = 68280.0 (accesses/s).
Then we sum the values from the different processes.

> Given that this thing is probabilistic, did you notice much variance from run
> to run?

The results looks quite stable for me.  The standard deviation of
results run to run is less than 1% for me.

>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 35be7a7271f4..9f1343b066c1 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -746,7 +746,16 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>  	 */
>>  
>>  	si->flags += SWP_SCANNING;
>> -	scan_base = offset = si->cluster_next;
>> +	/*
>> +	 * Use percpu scan base for SSD to reduce lock contention on
>> +	 * cluster and swap cache.  For HDD, sequential access is more
>> +	 * important.
>> +	 */
>> +	if (si->flags & SWP_SOLIDSTATE)
>> +		scan_base = this_cpu_read(*si->cluster_next_cpu);
>> +	else
>> +		scan_base = si->cluster_next;
>> +	offset = scan_base;
>>  
>>  	/* SSD algorithm */
>>  	if (si->cluster_info) {
>
> It's just a nit but SWP_SOLIDSTATE and 'if (si->cluster_info)' are two ways to
> check the same thing and I'd stick with the one that's already there.

Yes.  In effect, (si->flags & SWP_SOLIDSTATE) and (si->cluster_info)
always has same value at least for now.  But I don't think they are
exactly same in semantics.  So I would rather to use their exact
semantics.

>> @@ -2962,6 +2979,8 @@ static unsigned long read_swap_header(struct swap_info_struct *p,
>>  
>>  	p->lowest_bit  = 1;
>>  	p->cluster_next = 1;
>> +	for_each_possible_cpu(i)
>> +		per_cpu(*p->cluster_next_cpu, i) = 1;
>
> These are later overwritten if the device is an SSD which seems to be the only
> case where these are used, so why have this?

Yes.  You are right.  Will remove this in the future versions.

>>  	p->cluster_nr = 0;
>>  
>>  	maxpages = max_swapfile_size();
>> @@ -3204,6 +3223,10 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>>  		 * SSD
>>  		 */
>>  		p->cluster_next = 1 + prandom_u32_max(p->highest_bit);
>> +		for_each_possible_cpu(cpu) {
>> +			per_cpu(*p->cluster_next_cpu, cpu) =
>> +				1 + prandom_u32_max(p->highest_bit);
>> +		}
>
> Is there a reason for adding one?  The history didn't enlighten me about why
> cluster_next does it.

The first swap slot is the swap partition header, you cand find the
corresponding code in syscall swapon function, below comments "Read the
swap header.".

Best Regards,
Huang, Ying