linux-kernel - Re: [PATCH v2] mm: hwpoison: coredump: support recovery from dump_user

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f345b2b4-73e5-a88d-6cff-767827ab57d0@huawei.com>
Date:   Thu, 27 Apr 2023 09:06:46 +0800
From:   Kefeng Wang <wangkefeng.wang@...wei.com>
To:     "Luck, Tony" <tony.luck@...el.com>,
        HORIGUCHI NAOYA(堀口 直也) 
        <naoya.horiguchi@....com>
CC:     "chu, jane" <jane.chu@...cle.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Alexander Viro <viro@...iv.linux.org.uk>,
        Christian Brauner <brauner@...nel.org>,
        "linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Miaohe Lin <linmiaohe@...wei.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Tong Tiangen <tongtiangen@...wei.com>,
        Jens Axboe <axboe@...nel.dk>
Subject: Re: [PATCH v2] mm: hwpoison: coredump: support recovery from
 dump_user_range()



On 2023/4/26 23:45, Luck, Tony wrote:
>>>> Thanks for your confirm, and what your option about add
>>>> MCE_IN_KERNEL_COPYIN to EX_TYPE_DEFAULT_MCE_SAFE/FAULT_MCE_SAFE type
>>>> to let do_machine_check call queue_task_work(&m, msg, kill_me_never),
>>>> which kill every call memory_failure_queue() after mc safe copy return?
>>>
>>> I haven't been following this thread closely. Can you give a link to the e-mail
>>> where you posted a patch that does this? Or just repost that patch if easier.
>>
>> The major diff changes is [1], I will post a formal patch when -rc1 out,
>> thanks.
>>
>> [1]
>> https://lore.kernel.org/linux-mm/6dc1b117-020e-be9e-7e5e-a349ffb7d00a@huawei.com/
> 
> There seem to be a few misconceptions in that message. Not sure if all of them
> were resolved.  Here are some pertinent points:
> 
>>>> In my understanding, an MCE should not be triggered when MC-safe copy
>>>> tries
>>>> to access to a memory error.  So I feel that we might be talking about
>>>> different scenarios.
> 
> This is wrong. There is still a machine check when a MC-safe copy does a read
> from a location that has a memory error.
> 
> The recovery flow in this case does not involve queue_task_work(). That is only
> useful for machine check exceptions taken in user context. The queued work will
> be executed to call memory_failure() from the kernel, but in process context (not
> from the machine check exception stack) to handle the error.
> 
> For machine checks taken by kernel code (MC-safe copy functions) the recovery
> path is here:
> 
>                  if (m.kflags & MCE_IN_KERNEL_RECOV) {
>                          if (!fixup_exception(regs, X86_TRAP_MC, 0, 0))
>                                  mce_panic("Failed kernel mode recovery", &m, msg);
>                  }
> 
>                  if (m.kflags & MCE_IN_KERNEL_COPYIN)
>                          queue_task_work(&m, msg, kill_me_never);
> 
> The "fixup_exception()" ensures that on return from the machine check handler
> code returns to the extable[] fixup location instead of the instruction that was
> loading from the memory error location.
> 
> When the exception was from one of the copy_from_user() variants it makes
> sense to also do the queue_task_work() because the kernel is going to return
> to the user context (with an EFAULT error code from whatever system call was
> attempting the copy_from_user()).
> 
> But in the core dump case there is no return to user. The process is being
> terminated by the signal that leads to this core dump. So even though you
> may consider the page being accessed to be a "user" page, you can't fix
> it by queueing work to run on return to user.

For coredump，the task work will be called too, see following code,

get_signal
	sig_kernel_coredump
		elf_core_dump
			dump_user_range
				_copy_from_iter // with MC-safe copy, return without panic
	do_group_exit(ksig->info.si_signo);
		do_exit
			exit_task_work
				task_work_run
					kill_me_never
						memory_failure

I also add debug print to check the memory_failure() processing after
add MCE_IN_KERNEL_COPYIN to MCE_SAFE exception type, also tested CoW of
normal page and huge page, it works too.

> 
> I don't have an well thought out suggestion on how to make sure that memory_failure()
> is called for the page in this case. Maybe the core dump code can check for the
> return from the MC-safe copy it is using and handle it in the error path?
> 

So we could safely add MCE_IN_KERNEL_COPYIN to all MCE_SAFE exception type?

The kernel diff based on next-20230425 shown bellow

diff --git a/arch/x86/kernel/cpu/mce/severity.c 
b/arch/x86/kernel/cpu/mce/severity.c
index c4477162c07d..63e94484c5d6 100644
--- a/arch/x86/kernel/cpu/mce/severity.c
+++ b/arch/x86/kernel/cpu/mce/severity.c
@@ -293,12 +293,11 @@ static noinstr int error_context(struct mce *m, 
struct pt_regs *regs)
  	case EX_TYPE_COPY:
  		if (!copy_user)
  			return IN_KERNEL;
-		m->kflags |= MCE_IN_KERNEL_COPYIN;
  		fallthrough;

  	case EX_TYPE_FAULT_MCE_SAFE:
  	case EX_TYPE_DEFAULT_MCE_SAFE:
-		m->kflags |= MCE_IN_KERNEL_RECOV;
+		m->kflags |= MCE_IN_KERNEL_RECOV | MCE_IN_KERNEL_COPYIN;
  		return IN_KERNEL_RECOV;

  	default:
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4509a566fe6c..59a7afe3dfce 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3667,23 +3667,19 @@ enum mf_flags {
  int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
  		      unsigned long count, int mf_flags);
  extern int memory_failure(unsigned long pfn, int flags);
+extern void memory_failure_queue(unsigned long pfn, int flags);
  extern void memory_failure_queue_kick(int cpu);
  extern int unpoison_memory(unsigned long pfn);
  extern void shake_page(struct page *p);
  extern atomic_long_t num_poisoned_pages __read_mostly;
  extern int soft_offline_page(unsigned long pfn, int flags);
  #ifdef CONFIG_MEMORY_FAILURE
-extern void memory_failure_queue(unsigned long pfn, int flags);
  extern int __get_huge_page_for_hwpoison(unsigned long pfn, int flags,
  					bool *migratable_cleared);
  void num_poisoned_pages_inc(unsigned long pfn);
  void num_poisoned_pages_sub(unsigned long pfn, long i);
  struct task_struct *task_early_kill(struct task_struct *tsk, int 
force_early);
  #else
-static inline void memory_failure_queue(unsigned long pfn, int flags)
-{
-}
-
  static inline int __get_huge_page_for_hwpoison(unsigned long pfn, int 
flags,
  					bool *migratable_cleared)
  {
diff --git a/mm/ksm.c b/mm/ksm.c
index 0156bded3a66..7abdf4892387 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -2794,7 +2794,6 @@ struct page *ksm_might_need_to_copy(struct page *page,
  	if (new_page) {
  		if (copy_mc_user_highpage(new_page, page, address, vma)) {
  			put_page(new_page);
-			memory_failure_queue(page_to_pfn(page), 0);
  			return ERR_PTR(-EHWPOISON);
  		}
  		SetPageDirty(new_page);
diff --git a/mm/memory.c b/mm/memory.c
index 5e2c6b1fc00e..c0f586257017 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2814,10 +2814,8 @@ static inline int __wp_page_copy_user(struct page 
*dst, struct page *src,
  	unsigned long addr = vmf->address;

  	if (likely(src)) {
-		if (copy_mc_user_highpage(dst, src, addr, vma)) {
-			memory_failure_queue(page_to_pfn(src), 0);
+		if (copy_mc_user_highpage(dst, src, addr, vma))
  			return -EHWPOISON;
-		}
  		return 0;
  	}

@@ -5852,10 +5850,8 @@ static int copy_user_gigantic_page(struct folio 
*dst, struct folio *src,

  		cond_resched();
  		if (copy_mc_user_highpage(dst_page, src_page,
-					  addr + i*PAGE_SIZE, vma)) {
-			memory_failure_queue(page_to_pfn(src_page), 0);
+					  addr + i*PAGE_SIZE, vma))
  			return -EHWPOISON;
-		}
  	}
  	return 0;
  }
@@ -5871,10 +5867,8 @@ static int copy_subpage(unsigned long addr, int 
idx, void *arg)
  	struct copy_subpage_arg *copy_arg = arg;

  	if (copy_mc_user_highpage(copy_arg->dst + idx, copy_arg->src + idx,
-				  addr, copy_arg->vma)) {
-		memory_failure_queue(page_to_pfn(copy_arg->src + idx), 0);
+				  addr, copy_arg->vma))
  		return -EHWPOISON;
-	}
  	return 0;
  }

-- 
2.35.3


Thanks,

> -Tony