linux-kernel - Re: [PATCH gmem FIXUP] kvm: guestmem: do not use a file system

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZSa5+upNcrclO35a@yilunxu-OptiPlex-7050>
Date:   Wed, 11 Oct 2023 23:06:34 +0800
From:   Xu Yilun <yilun.xu@...ux.intel.com>
To:     Sean Christopherson <seanjc@...gle.com>
Cc:     Al Viro <viro@...iv.linux.org.uk>, linux-kernel@...r.kernel.org,
        kvm@...r.kernel.org, Paolo Bonzini <pbonzini@...hat.com>
Subject: Re: [PATCH gmem FIXUP] kvm: guestmem: do not use a file system

On Tue, Oct 10, 2023 at 04:30:18PM -0700, Sean Christopherson wrote:
> On Tue, Oct 10, 2023, Al Viro wrote:
> > On Mon, Oct 09, 2023 at 05:27:04PM -0700, Sean Christopherson wrote:
> > 
> > > If the last reference is effectively held by guest_memfd, it would be:
> > > 
> > >   kvm_gmem_release(), a.k.a. file_operations.release()
> > >   |
> > >   -> kvm_put_kvm()
> > >      |
> > >      -> kvm_destroy_vm()
> > >         |
> > >         -> module_put(kvm_chardev_ops.owner);
> > 
> > ... and now your thread gets preempted and loses CPU; before you get
> > it back, some joker calls delete_module(), and page of code containing
> > kvm_gmem_release() is unmapped.  Even though an address within that
> > page is stored as return address in a frame on your thread's stack.
> > That thread gets the timeslice again and proceeds to return into
> > unmapped page.  Oops...
> 
> *sigh*
> 
> What an absolute snafu.  Sorry for the wall of text.  Feel free to stop reading
> after the "Back to KVM..." part below, it's just gory details on how we screwed
> things up in KVM.
> 
> But one question before I dive into the KVM mess: other than error paths, how is
> module_put(THIS_MODULE) ever safe without being superfluous?  I don't see how a
> module can put the last reference to itself without either hitting the above
> scenario, or without creating deadlock.  Something other than the module must put

I think module_get/put(THIS_MODULE) is not responsible for guarding
this scenario. It just makes a section against module removal. Out of
this section, you still need other mechanisms to ensure no task run on
the module code before its module_exit() return. This is still true even
if no module_get/put(THIS_MODULE) is used.

Today with all your help, I can see the fops.owner = THIS_MODULE is an
efficient way to ensure no entry to file callbacks when module_exit() is
called.

> the last reference, no?
> 
> The only exceptions I see are:
> 
>   1. if module_put() is called via async_run_entry_fn(), as delete_module() invokes
>      async_synchronize_full() before unmapping the module.  But IIUC, the async
>      framework uses workqueues, not the other way around.  I.e. delete_module()
>      doesn't flush arbitrary workqueues.
> 
>   2. if module_put() is called via module_put_and_kthread_exit(), which uses
>      THIS_MODULE but does module_put() from a core kernel helper and never returns
>      to the module's kthread, i.e. doesn't return to module code.
> 
> But then this
> 
>   $ git grep -E "module_put\(THIS_MODULE" | wc -l
>   132
> 
> make me go "huh"?  I've blamed a handful of those calls, and I can't find a single
> one that provides any clue as to why the module gets/puts references to itself,
> let alone actually justifies the usage.
> 
> E.g. drivers/block/loop.c has this gem
> 
> 	/* This is safe: open() is still holding a reference. */
> 	module_put(THIS_MODULE);
> 
> in __loop_clr_fd(), which is invoked from a .release() function.  So open() quite
> clearly doesn't hold a reference, unless the comment is talking about the reference
> that was obtained by the core file systems layer and won't be put until after
> .release() completes.  But then what on earth is the point of doing
> module_get(THIS_MODULE) and module_put(THIS_MODULE)?

I see the comment that .release()->__loop_clr_fd() is only called when in
"autoclear mode". Maybe in some "manual mode", user should explicitly
call IOCTL(LOOP_CLR_FD) to allow module removal. That means in some
case, user called IOCTL(LOOP_CONFIGURE) and then closed all fds, but the
device state was not cleaned up, so that the driver module is not allowed to
be removed.

Thanks,
Yilun

> 
> 
> Back to KVM...
> 
> Commit 5f6de5cbebee ("KVM: Prevent module exit until all VMs are freed") *tried*
> to fix a bug where KVM-the-module could be unloaded while a KVM workqueue callback
> was still in-flight.  The callback had a reference to the VM, but not to the VM's
> file representation.
> 
> After that commit went in, I suggested dropping the use of .owner for VMs and
> vCPUs (each of which is represented by an anon inode file) because keeping the VM
> alive would pin KVM-the-module until all VMs went away.  But I missed the
> obvious-in-hindsight issue Al highlighted above.
> 
> Fixing that particular wart is relatively easy: revert commit 70375c2d8fa3 ("Revert
> "KVM: set owner of cpu and vm file operations""), and give all of the other KVM-owned
> file_operations structures the same treatment by setting .owner correctly.  Note,
> "correctly" isn't THIS_MODULE in most cases, which is why the code existing is a
> bit odd.  For most file_operations, on x86 and PPC (and MIPS?), the effective owner
> is actually a sub-module, e.g. THIS_MODULE will point at kvm.ko, but on x86 the
> effective owner is either kvm-intel.ko or kvm-amd.ko (which holds a reference to
> kvm.ko).
> 
> After staring and fiddling for most of today, I finally discovered that grabbing
> a reference to the module on behalf of the work item didn't fix the actual bugs,
> plural, it just shuffled the deck chairs on the Titanic.  And as above, it set us
> up to make even bigger mistakes regarding .owner :-(
> 
> The problematic code is effectively kvm_clear_async_pf_completion_queue().  That
> helper is called for each vCPU when a VM is being destroyed, i.e. when the last
> reference to a VM is put via kvm_put_kvm().  Clearing the queue *should* also
> flush all work items, except it doesn't when the work is "done", where "done" just
> means the page being faulted in is ready.  Using file_operations.owner doesn't
> solve anything, e.g. even if async_pf_execute() were gifted a reference to the
> VM's file and used the deferred fput(), the same preemption issue exists, it's
> just slightly harder to hit.
> 
> The original async #PF code appears to have fudged around the lack of flushing by
> gifting a VM reference to the async_pf_execute().  Or maybe it was the other way
> around and not flushing was a workaround for the deadlock that occurs if
> kvm_clear_async_pf_completion_queue() does flush the workqueue.  If kvm_put_kvm()
> is called from async_pf_execute() and kvm_put_kvm() flushes the async #PF workqueue,
> deadlock occurs becase async_pf_execute() can't return until kvm_put_kvm() finishes,
> and kvm_put_kvm() can't return until async_pf_execute() finishes.
> 
>  WARNING: CPU: 8 PID: 251 at virt/kvm/kvm_main.c:1435 kvm_put_kvm+0x2d/0x320 [kvm]
>  Modules linked in: vhost_net vhost vhost_iotlb tap kvm_intel kvm irqbypass
>  CPU: 8 PID: 251 Comm: kworker/8:1 Tainted: G        W          6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>  Workqueue: events async_pf_execute [kvm]
>  RIP: 0010:kvm_put_kvm+0x2d/0x320 [kvm]
>  Call Trace:
>   <TASK>
>   async_pf_execute+0x198/0x260 [kvm]
>   process_one_work+0x145/0x2d0
>   worker_thread+0x27e/0x3a0
>   kthread+0xba/0xe0
>   ret_from_fork+0x2d/0x50
>   ret_from_fork_asm+0x11/0x20
>   </TASK>
>  ---[ end trace 0000000000000000 ]---
>  INFO: task kworker/8:1:251 blocked for more than 120 seconds.
>        Tainted: G        W          6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>  task:kworker/8:1     state:D stack:0     pid:251   ppid:2      flags:0x00004000
>  Workqueue: events async_pf_execute [kvm]
>  Call Trace:
>   <TASK>
>   __schedule+0x33f/0xa40
>   schedule+0x53/0xc0
>   schedule_timeout+0x12a/0x140
>   __wait_for_common+0x8d/0x1d0
>   __flush_work.isra.0+0x19f/0x2c0
>   kvm_clear_async_pf_completion_queue+0x129/0x190 [kvm]
>   kvm_arch_destroy_vm+0x78/0x1b0 [kvm]
>   kvm_put_kvm+0x1c1/0x320 [kvm]
>   async_pf_execute+0x198/0x260 [kvm]
>   process_one_work+0x145/0x2d0
>   worker_thread+0x27e/0x3a0
>   kthread+0xba/0xe0
>   ret_from_fork+0x2d/0x50
>   ret_from_fork_asm+0x11/0x20
>   </TASK>
> 
> If kvm_clear_async_pf_completion_queue() actually flushes the workqueue, then
> there's no need to gift async_pf_execute() a reference because all invocations
> of async_pf_execute() will be forced to complete before the vCPU and its VM are
> destroyed/freed.  And that also fixes the module unloading mess because __fput()
> won't do module_put() on the last vCPU reference until the vCPU has been freed.
> 
> The attached patches are lightly tested, but I think they fix the KVM mess.  I
> likely won't post a proper series until next week, I'm going to be offline the
> next two days.

> From 017fedee5608094f2e5535297443db7512a213b8 Mon Sep 17 00:00:00 2001
> From: Sean Christopherson <seanjc@...gle.com>
> Date: Tue, 10 Oct 2023 11:42:32 -0700
> Subject: [PATCH 1/3] KVM: Set file_operations.owner appropriately for all such
>  structures
> 
> This reverts commit 70375c2d8fa3fb9b0b59207a9c5df1e2e1205c10, and gives
> 
> Signed-off-by: Sean Christopherson <seanjc@...gle.com>
> ---
>  arch/x86/kvm/debugfs.c |  1 +
>  virt/kvm/kvm_main.c    | 11 ++++++++---
>  2 files changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/debugfs.c b/arch/x86/kvm/debugfs.c
> index ee8c4c3496ed..eea6ea7f14af 100644
> --- a/arch/x86/kvm/debugfs.c
> +++ b/arch/x86/kvm/debugfs.c
> @@ -182,6 +182,7 @@ static int kvm_mmu_rmaps_stat_release(struct inode *inode, struct file *file)
>  }
>  
>  static const struct file_operations mmu_rmaps_stat_fops = {
> +	.owner		= THIS_MODULE,
>  	.open		= kvm_mmu_rmaps_stat_open,
>  	.read		= seq_read,
>  	.llseek		= seq_lseek,
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 486800a7024b..1e65a506985f 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -3887,7 +3887,7 @@ static int kvm_vcpu_release(struct inode *inode, struct file *filp)
>  	return 0;
>  }
>  
> -static const struct file_operations kvm_vcpu_fops = {
> +static struct file_operations kvm_vcpu_fops = {
>  	.release        = kvm_vcpu_release,
>  	.unlocked_ioctl = kvm_vcpu_ioctl,
>  	.mmap           = kvm_vcpu_mmap,
> @@ -4081,6 +4081,7 @@ static int kvm_vcpu_stats_release(struct inode *inode, struct file *file)
>  }
>  
>  static const struct file_operations kvm_vcpu_stats_fops = {
> +	.owner = THIS_MODULE,
>  	.read = kvm_vcpu_stats_read,
>  	.release = kvm_vcpu_stats_release,
>  	.llseek = noop_llseek,
> @@ -4431,7 +4432,7 @@ static int kvm_device_release(struct inode *inode, struct file *filp)
>  	return 0;
>  }
>  
> -static const struct file_operations kvm_device_fops = {
> +static struct file_operations kvm_device_fops = {
>  	.unlocked_ioctl = kvm_device_ioctl,
>  	.release = kvm_device_release,
>  	KVM_COMPAT(kvm_device_ioctl),
> @@ -4759,6 +4760,7 @@ static int kvm_vm_stats_release(struct inode *inode, struct file *file)
>  }
>  
>  static const struct file_operations kvm_vm_stats_fops = {
> +	.owner = THIS_MODULE,
>  	.read = kvm_vm_stats_read,
>  	.release = kvm_vm_stats_release,
>  	.llseek = noop_llseek,
> @@ -5060,7 +5062,7 @@ static long kvm_vm_compat_ioctl(struct file *filp,
>  }
>  #endif
>  
> -static const struct file_operations kvm_vm_fops = {
> +static struct file_operations kvm_vm_fops = {
>  	.release        = kvm_vm_release,
>  	.unlocked_ioctl = kvm_vm_ioctl,
>  	.llseek		= noop_llseek,
> @@ -6095,6 +6097,9 @@ int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module)
>  		goto err_async_pf;
>  
>  	kvm_chardev_ops.owner = module;
> +	kvm_vm_fops.owner = module;
> +	kvm_vcpu_fops.owner = module;
> +	kvm_device_fops.owner = module;
>  
>  	kvm_preempt_ops.sched_in = kvm_sched_in;
>  	kvm_preempt_ops.sched_out = kvm_sched_out;
> 
> base-commit: dfdc8b7884b50e3bfa635292973b530a97689f12
> -- 
> 2.42.0.609.gbb76f46606-goog
> 

> From f5be42f3be9967a0591051a7c8d73cac2c0a072b Mon Sep 17 00:00:00 2001
> From: Sean Christopherson <seanjc@...gle.com>
> Date: Tue, 10 Oct 2023 13:42:13 -0700
> Subject: [PATCH 2/3] KVM: Always flush async #PF workqueue when vCPU is being
>  destroyed
> 
> Signed-off-by: Sean Christopherson <seanjc@...gle.com>
> ---
>  virt/kvm/async_pf.c | 15 ++++++++++++---
>  1 file changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> index e033c79d528e..7aeb9d1f43b1 100644
> --- a/virt/kvm/async_pf.c
> +++ b/virt/kvm/async_pf.c
> @@ -87,7 +87,6 @@ static void async_pf_execute(struct work_struct *work)
>  	__kvm_vcpu_wake_up(vcpu);
>  
>  	mmput(mm);
> -	kvm_put_kvm(vcpu->kvm);
>  }
>  
>  void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> @@ -114,7 +113,6 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
>  #else
>  		if (cancel_work_sync(&work->work)) {
>  			mmput(work->mm);
> -			kvm_put_kvm(vcpu->kvm); /* == work->vcpu->kvm */
>  			kmem_cache_free(async_pf_cache, work);
>  		}
>  #endif
> @@ -126,7 +124,19 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
>  			list_first_entry(&vcpu->async_pf.done,
>  					 typeof(*work), link);
>  		list_del(&work->link);
> +
> +		spin_unlock(&vcpu->async_pf.lock);
> +
> +		/*
> +		 * The async #PF is "done", but KVM must wait for the work item
> +		 * itself, i.e. async_pf_execute(), to run to completion.  If
> +		 * KVM is a module, KVM must ensure *no* code owned by the KVM
> +		 * (the module) can be run after the last call to module_put(),
> +		 * i.e. after the last reference to the last vCPU's file is put.
> +		 */
> +		flush_work(&work->work);
>  		kmem_cache_free(async_pf_cache, work);
> +		spin_lock(&vcpu->async_pf.lock);
>  	}
>  	spin_unlock(&vcpu->async_pf.lock);
>  
> @@ -186,7 +196,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  	work->arch = *arch;
>  	work->mm = current->mm;
>  	mmget(work->mm);
> -	kvm_get_kvm(work->vcpu->kvm);
>  
>  	INIT_WORK(&work->work, async_pf_execute);
>  
> -- 
> 2.42.0.609.gbb76f46606-goog
> 

> From 0a4238f027e41c64afa2919440420ea56c0cae80 Mon Sep 17 00:00:00 2001
> From: Sean Christopherson <seanjc@...gle.com>
> Date: Tue, 10 Oct 2023 15:09:43 -0700
> Subject: [PATCH 3/3] Revert "KVM: Prevent module exit until all VMs are freed"
> 
> Revert KVM's misguided attempt to "fix" a use-after-module-unload bug that
> was actually due to failure to flush a workqueue, not a lack of module
> refcounting.
> 
> blah blah blah
> 
> This reverts commit 405294f29faee5de8c10cb9d4a90e229c2835279 and commit
> commit 5f6de5cbebee925a612856fce6f9182bb3eee0db.
> 
> Fixes: 405294f29fae ("KVM: Unconditionally get a ref to /dev/kvm module when creating a VM")
> Fixes: 5f6de5cbebee ("KVM: Prevent module exit until all VMs are freed")
> Signed-off-by: Sean Christopherson <seanjc@...gle.com>
> ---
>  virt/kvm/kvm_main.c | 7 -------
>  1 file changed, 7 deletions(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 1e65a506985f..3b1b9e8dd70c 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -115,8 +115,6 @@ EXPORT_SYMBOL_GPL(kvm_debugfs_dir);
>  
>  static const struct file_operations stat_fops_per_vm;
>  
> -static struct file_operations kvm_chardev_ops;
> -
>  static long kvm_vcpu_ioctl(struct file *file, unsigned int ioctl,
>  			   unsigned long arg);
>  #ifdef CONFIG_KVM_COMPAT
> @@ -1157,9 +1155,6 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
>  	if (!kvm)
>  		return ERR_PTR(-ENOMEM);
>  
> -	/* KVM is pinned via open("/dev/kvm"), the fd passed to this ioctl(). */
> -	__module_get(kvm_chardev_ops.owner);
> -
>  	KVM_MMU_LOCK_INIT(kvm);
>  	mmgrab(current->mm);
>  	kvm->mm = current->mm;
> @@ -1279,7 +1274,6 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
>  out_err_no_srcu:
>  	kvm_arch_free_vm(kvm);
>  	mmdrop(current->mm);
> -	module_put(kvm_chardev_ops.owner);
>  	return ERR_PTR(r);
>  }
>  
> @@ -1348,7 +1342,6 @@ static void kvm_destroy_vm(struct kvm *kvm)
>  	preempt_notifier_dec();
>  	hardware_disable_all();
>  	mmdrop(mm);
> -	module_put(kvm_chardev_ops.owner);
>  }
>  
>  void kvm_get_kvm(struct kvm *kvm)
> -- 
> 2.42.0.609.gbb76f46606-goog
>