lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20161012145826.wwxecoo4o3ypos5o@hz-desktop>
Date:   Wed, 12 Oct 2016 22:58:26 +0800
From:   Haozhong Zhang <haozhong.zhang@...el.com>
To:     Jan Beulich <JBeulich@...e.com>,
        "Dan Williams" <dan.j.williams@...el.com>,
        Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>
Cc:     Stefano Stabellini <stefano@...reto.com>,
        Arnd Bergmann <arnd@...db.de>, <andrew.cooper3@...rix.com>,
        David Vrabel <david.vrabel@...rix.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Xiao Guangrong <guangrong.xiao@...ux.intel.com>,
        Ross Zwisler <ross.zwisler@...ux.intel.com>,
        <xen-devel@...ts.xenproject.org>,
        "linux-nvdimm@...ts.01.org" <linux-nvdimm@...1.01.org>,
        Boris Ostrovsky <boris.ostrovsky@...cle.com>,
        Juergen Gross <JGross@...e.com>,
        Johannes Thumshirn <jthumshirn@...e.de>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for
 Xen

On 10/12/16 05:32 -0600, Jan Beulich wrote:
>>>> On 12.10.16 at 12:33, <haozhong.zhang@...el.com> wrote:
>> The layout is shown as the following diagram.
>>
>> +---------------+-----------+-------+----------+--------------+
>> | whatever used | Partition | Super | Reserved | /dev/pmem0p1 |
>> |  by kernel    |   Table   | Block | for Xen  |              |
>> +---------------+-----------+-------+----------+--------------+
>>                 \_____________________ _______________________/
>> 		                      V
>> 				 /dev/pmem0
>
>I have to admit that I dislike this, for not being OS-agnostic.
>Neither should there be any Xen-specific region, nor should the
>"whatever used by kernel" one be restricted to just Linux. What
>I could see is an OS-reserved area ahead of the partition table,
>the exact usage of which depends on which OS is currently
>running (and in the Xen case this might be both Xen _and_ the
>Dom0 kernel, arbitrated by a tbd protocol). After all, when
>running under Xen, the Dom0 may not have a need for as much
>control data as it has when running on bare hardware, for it
>controlling less (if any) of the actual memory ranges when Xen
>is present.
>

Isn't this OS-reserved area still not OS-agnostic, as it requires OS
to know where the reserved area is?  Or do you mean it's not if it's
defined by a protocol that is accepted by all OSes?

Let me list another two methods just coming to my mind.

1. The first method extends the usage of the super block used by
   current Linux kernel to reserve space on pmem.

   Current Linux kernel places a super block of the following
   structure near the beginning of a pmem namespace.

    struct nd_pfn_sb {
            u8 signature[PFN_SIG_LEN];
            u8 uuid[16];
            u8 parent_uuid[16];
            __le32 flags;
            __le16 version_major;
            __le16 version_minor;
            __le64 dataoff; /* relative to namespace_base + start_pad */
            __le64 npfns;
            __le32 mode;
            /* minor-version-1 additions for section alignment */
            __le32 start_pad;
            __le32 end_trunc;
            /* minor-version-2 record the base alignment of the mapping */
            __le32 align;
            u8 padding[4000];
            __le64 checksum;
    }

    Two interesting fields here are 'dataoff' and 'mode':
    - 'dataoff' indicates the offset where the data area starts,
      ie. IIUC, the part that can be accessed via /dev/pmemN or
      /dev/daxN.
    - 'mode' indicates whether Linux puts struct page for this
      namespace in the ram (= PFN_MODE_RAM) or on the device (=
      PFN_MODE_PMEM).

    Currently for Linux, only 'mode' is customizable, while 'dataoff'
    is not. If mode == PFN_MODE_RAM, no reservation for struct page is
    made on the device, and dataoff starts almost immediately after
    the super block except a small reserved area in between for other
    structures and alignment. If mode == PFN_MODE_PMEM, the size of
    the reservation is decided by kernel, i.e. 64 bytes per struct
    page.

    I propose to make the size of the reserved area customizable,
    e.g. via ioctl and ndctl.
    - If mode == PFN_MODE_PMEM and
      * if the given reserved size is large enough to hold what an OS
        (not limited to Linux) wants to put in, then the OS just
        starts use it as desired;
      * if the given reserved size is not enough, then the OS reports
        error and may take other fallback actions.
    - If mode == PFN_MODE_RAM and
      * if the reserved size is zero, then it's the current way that
        Linux uses the device;
      * if the reserved size is non-zero, I would like to reserve this
        case for hypervisor (right now, namely Xen hypervisor)
        usage. That is, the OS should not use the reserved area. For
        Xen, we could add a function in xen driver in kernel to report
        the reserved area to hypervisor.

   I guess this might be the OS-agnostic way Jan expects, but Dan may
   object to.


2. Lay another pseudo device on the block device (e.g. /dev/pmemN)
   provided by the NVDIMM driver.

   This pseudo device can reserve the size according to user's
   requirement. The reservation information can be persistently
   recorded in a super block before the reserved area.

   This pseudo device also implements another pseudo block device to
   allow the non-reserved area be accessed as a block device (we can
   even implement it as DAX-capable).

                                               pseudo block device
                                             /---------^-----------\
+------------------+-------+---------------+-----------------------+
|  whatever used   | Super |  reserved by  |                       |
| by NVDIMM driver | Block | pseudo device |                       |
+------------------+-------+---------------+-----------------------+
                     \_____________________ _______________________/
                                           V
                                       /dev/pmem0
                                (provided by NVDIMM driver)

   In order to make it work across difference OSes, it requires other
   OS recognizes the same types of pmem block devices made by Linux,
   and implements the driver for the pseudo device.

   This is inspired by Dan's reply at
   https://lists.xenproject.org/archives/html/xen-devel/2016-10/msg00651.html.

   However, it's essentially the same as my partition solution, so I guess
   Jan will still dislike.


Any comments?

>The assumption of course is that the reserved area holds no
>persistent data. If that assumption didn't hold, you'd have to
>have per-OS reserved areas anyway (as many of them as
>there might be OSes [planned to get] installed on a particular
>system).
>

No persistent data should be placed in the reserved area.

Thanks,
Haozhong

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ