linux-kernel - Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <51E028AA.7070203@jp.fujitsu.com>
Date:	Sat, 13 Jul 2013 01:02:50 +0900
From:	HATAYAMA Daisuke <d.hatayama@...fujitsu.com>
To:	Michael Holzheu <holzheu@...ux.vnet.ibm.com>
CC:	kexec@...ts.infradead.org,
	Heiko Carstens <heiko.carstens@...ibm.com>,
	Jan Willeke <willeke@...ibm.com>, linux-kernel@...r.kernel.org,
	Martin Schwidefsky <schwidefsky@...ibm.com>,
	Vivek Goyal <vgoyal@...hat.com>
Subject: Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

(2013/07/10 20:00), Michael Holzheu wrote:
> On Wed, 10 Jul 2013 18:50:18 +0900
> HATAYAMA Daisuke <d.hatayama@...fujitsu.com> wrote:
>
> [snip]
>
>> (2013/07/10 17:42), Michael Holzheu wrote:
>>> My suggestion is to add the WARN_ONCE() for #ifndef CONFIG_S390. This has the same
>>> effect as your suggestion for all architectures besides of s390. And for s390 we
>>> take the risk that a programming error would result in poor /proc/vmcore
>>> performance.
>>>
>>
>> If you want to avoid looking up vmcore_list that takes linear time w.r.t. the number
>> of the elements, you can still calculate the range of offsets in /proc/vmcore
>> corresponding to HSA during /proc/vmcore initialization.
>>
>> Also, could you tell me how often and how much the HSA region is during crash dumping?
>> I guess the read to HSA is done mainly during early part of crash dumping process only.
>> According to the code, it appears at most 64MiB only. Then, I feel performance is not
>> a big issue.
>
> Currently it is 32 MiB and normally it is read only once.
>
>>
>> Also, cost of WARN_ONCE() is one memory access only in the 2nd and later calls. I don't
>> think it too much overhead...
>
> I was more concerned about in_valid_fault_range(). But I was most concerned the additional
> interface that introduces more complexity to the code. And that just to implement a
> sanity check that in our opinion we don't really need.
>
> And what makes it even worse:
>

What you think the sanity check is unnecessary is perfectly wrong. You design page faults
always happens on HSA region. If page fault happens on the other parts, i.e. some point
of mmap()ed region, it means somehow page table on the address has not been created. This
is bug, possibly caused by mmap() itself, page table creation, other components in kernel,
bit-flip due to broken hardware, etc. Anyway, program cannot detect what kind of bug occurs
now. There's no guarantee that program runs safely, of course for page cache creation, too.
We cannot and must expect such buggy process to behave in invalid states just as our design.
It results in undefined behaviour. The only thing we can do is to kill the buggy process
as soon as possible.

> With the current patch series this check is only relevant for s390 :-)
>
>>
>>> So, at least for this patch series I would implement the fault handler as follows:
>>>
>>> static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>>> {
>>> ...
>>>           char *buf;
>>>           int rc;
>>>
>>> #ifndef CONFIG_S390
>>>           WARN_ONCE(1, "vmcore: Unexpected call of mmap_vmcore_fault()");
>>> #endif
>>>           page = find_or_create_page(mapping, index, GFP_KERNEL);
>>>
>>> At this point I have to tell you that we plan another vmcore patch series where
>>> the fault handler might be called also for other architectures. But I think we
>>> should *then* discuss your issue again.
>>>
>>
>> Could you explain the plan in more detail? Or I cannot review correctly since I don't
>> know whether there's really usecase of this generic fault handler for other
>> architectures.
>> This is the issue for architectures other than s390, not mine; now we
>> don't need it at all.
>
> I would have preferred to do the things one after the other. Otherwise I fear
> that this discussion will never come to an end. This patch series is needed
> to get zfcpdump running with /proc/vmcore. And for that reason the patch series
> makes sense for itself.
>
> But FYI:
>
> The other patch series deals with the problem that we have additional
> information for s390 that we want to include in /proc/vmcore. We have a one
> byte storage key per memory page. This storage keys are stored in the
> s390 firmware and can be read using a s390 specific machine instruction.
> We plan to put that information into the ELF notes section. For a 1 TiB
> dump we will have 256 MiB storage keys.
>
> Currently the notes section is preallocated in vmcore.c. Because we do
> not want to preallocate so much memory we would like to read the notes
> section on demand. Similar to the HSA memory for zfcpdump. To get this
> work with your mmap code, we would then also use the fault handler to
> get the notes sections on demand. Our current patch does this for all
> notes and therefore also the other architectures would then use the
> fault handler.
>
> One advantage for the common code is that no additional memory has
> to be allocated for the notes buffer.
>
> The storage key patch series is currently not final because it depends
> on the zfcpdump patch series.
>

Yes, on-demand memory allocation by fault handler is definitely suitable
for note sections throughout all architectures. But need of sanity check
is unchanged here. The issue is orthogonal. It's still needed just as
explained above.

BTW, sanity check for note sections is easy.

         if (!(elfnote->p_offset <= offset ||
               offset < elfnote->p_offset + elfnote->p_filesz))
             return VM_FAULT_SIGBUS;

I assume elfnote has a pointer to a unique PT_NOTE program header table
in the ELF header buffer. The code would be the same on every
architecture.

-- 
Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/