linux-kernel - Re: [RFC PATCH v4] sched: Fix performance regression introduced by mm

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <b9f72ee9-328c-9de3-05e7-ea7d1339984b@oracle.com>
Date:   Mon, 10 Apr 2023 16:51:38 -0500
From:   michael.christie@...cle.com
To:     Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
        Peter Zijlstra <peterz@...radead.org>
Cc:     linux-kernel@...r.kernel.org, Aaron Lu <aaron.lu@...el.com>,
        Olivier Dion <odion@...icios.com>
Subject: Re: [RFC PATCH v4] sched: Fix performance regression introduced by
 mm_cid

On 4/10/23 10:01 AM, Mathieu Desnoyers wrote:
> Introduce per-mm/cpu current concurrency id (mm_cid) to fix a PostgreSQL
> sysbench regression reported by Aaron Lu.
> 
> Keep track of the currently allocated mm_cid for each mm/cpu rather than
> freeing them immediately on context switch. This eliminates most atomic
> operations when context switching back and forth between threads
> belonging to different memory spaces in multi-threaded scenarios (many
> processes, each with many threads). The per-mm/per-cpu mm_cid values are
> serialized by their respective runqueue locks.
> 
> Thread migration is handled by introducing invocation to
> sched_mm_cid_migrate_from() in set_task_cpu() and to
> sched_mm_cid_migrate_to() (with destination runqueue lock held) in
> activate_task() for migrating tasks. set_task_cpu() is invoked with and
> without source rq lock held: the wakeup path does not hold the source rq
> lock.
> 
> sched_mm_cid_migrate_from() clears the mm_cid from the task's mm per-cpu
> index corresponding to the source runqueue if it matches the last mm_cid
> observed by the migrated task. This last mm_cid value is returned as a
> hint to conditionally clear the mm's per-cpu mm_cid on the destination
> cpu.
> 
> Then, in sched_mm_cid_migrate_to(), if the last mm_cid is smaller than
> the mm's destination cpu current mm_cid, clear the mm's destination cpu
> current mm_cid. If the migrated task's mm is in use on the destination
> cpu, the reclaim of the mm_cid will be done lazily on the next
> destination cpu context switch, else it is performed immediately.
> 
> The source cpu's mm_cid is _not_ simply moved to the destination cpu on
> migration, because passing ownership of the mm_cid value to the
> destination cpu while an actively running tasks also has its own
> mm_cid value (in case of lazy reclaim on next context switch) would
> over-allocate mm_cid values beyond the number of possible cpus.
> 
> Because we want to ensure the mm_cid converges towards the smaller
> values as migrations happen, the prior optimization that was done when
> context switching between threads belonging to the same mm is removed,
> because it could delay the lazy release of the destination runqueue
> mm_cid after it has been replaced by a migration. Removing this prior
> optimization is not an issue performance-wise because the introduced
> per-mm/per-cpu mm_cid tracking also covers this more specific case.
> 
> This patch is based on v6.3-rc6 with this patch applied:
> 
> ("mm: Fix memory leak on mm_init error handling")
> 
> https://lore.kernel.org/lkml/20230330133822.66271-1-mathieu.desnoyers@efficios.com/
> 
> Fixes: af7f588d8f73 ("sched: Introduce per-memory-map concurrency ID")
> Link: https://lore.kernel.org/lkml/20230327080502.GA570847@ziqianlu-desk2/
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
> Cc: Aaron Lu <aaron.lu@...el.com>
> Cc: Peter Zijlstra <peterz@...radead.org>
> Cc: Olivier Dion <odion@...icios.com>
> Cc: michael.christie@...cle.com

Hey thanks for fixing this.

When testing linux-next with vhost devices, without this patch IOPs get stuck at around
1.3 million IOPs total when using 8 or more devices (you get a worker thread per device)
per VM. With this patch applied IOPs scale again, and we get up to 2.4M iops when using
up to 16 devices per VM.

Tested-by: Mike Christie <michael.christie@...cle.com>