linux-kernel - Re: [RFC v5 1/4] pstore/blk: new support logger for block devices

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1497804b-f921-b0cf-1327-3b17beaf9ce7@allwinnertech.com>
Date:   Sat, 19 Jan 2019 16:53:48 +0800
From:   liaoweixiong <liaoweixiong@...winnertech.com>
To:     Kees Cook <keescook@...omium.org>
Cc:     Anton Vorontsov <anton@...msg.org>,
        Colin Cross <ccross@...roid.com>,
        Tony Luck <tony.luck@...el.com>,
        Jonathan Corbet <corbet@....net>,
        Mauro Carvalho Chehab <mchehab+samsung@...nel.org>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        "David S. Miller" <davem@...emloft.net>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Nicolas Ferre <nicolas.ferre@...rochip.com>,
        Arnd Bergmann <arnd@...db.de>,
        "open list:DOCUMENTATION" <linux-doc@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>
Subject: Re: [RFC v5 1/4] pstore/blk: new support logger for block devices

On 2019-01-18 08:12, Kees Cook wrote:
> On Mon, Jan 7, 2019 at 4:01 AM liaoweixiong
> <liaoweixiong@...winnertech.com> wrote:
>>
>> pstore_blk is similar to pstore_ram, but dump log to block devices
>> rather than persistent ram.
>>
>> Why should we need pstore_blk?
>> 1. Most embedded intelligent equipment have no persistent ram, which
>> increases costs. We perfer to cheaper solutions, like block devices.
>> In fast, there is already a sample for block device logger in driver
> 
> typo: fast -> fact

Fixed. Thank you for your careful correction.

> 
>> MTD (drivers/mtd/mtdoops.c).
> 
> Would mtdoops get dropped in favor of pstore/blk, or do they not share features?
> 

We can show them what pstore/blk do. I think they will be interested in it.
They should do a little work, including make a function for panic read,
then they gain enhanced features, including present logs as a file,
record multiple logs.

>> 2. Do not any equipment have battery, which means that it lost all data
>> on general ram if power failure. Pstore has little to do for these
>> equipments.
>>
>> pstore_blk can only dump Oops/Panic log to block devices. It only
>> supports dmesg now. To make pstore_blk work, the block driver should
>> provide the path of a block partition device (/dev/XXXX) and the
>> read/write apis when on panic.
>>
>> pstore_blk begins at 'blkz_register', by witch block device can register
>> a block partition to pstore_blk. Then pstore_blk divide and manage the
>> partition as zones, which is similar to pstore_ram.
>>
>> Recommend that, block driver register pstore_blk after block device is
>> ready.
>>
>> pstore_blk works well on allwinner(sunxi) platform.
>>
>> Signed-off-by: liaoweixiong <liaoweixiong@...winnertech.com>
>> ---
>>  fs/pstore/Kconfig          |   7 +
>>  fs/pstore/Makefile         |   3 +
>>  fs/pstore/blkzone.c        | 958 +++++++++++++++++++++++++++++++++++++++++++++
>>  include/linux/pstore_blk.h |  61 +++
>>  4 files changed, 1029 insertions(+)
>>  create mode 100644 fs/pstore/blkzone.c
>>  create mode 100644 include/linux/pstore_blk.h
>>
>> diff --git a/fs/pstore/Kconfig b/fs/pstore/Kconfig
>> index 8b3ba27..a881c53 100644
>> --- a/fs/pstore/Kconfig
>> +++ b/fs/pstore/Kconfig
>> @@ -152,3 +152,10 @@ config PSTORE_RAM
>>           "ramoops.ko".
>>
>>           For more information, see Documentation/admin-guide/ramoops.rst.
>> +
>> +config PSTORE_BLK
>> +       tristate "Log panic/oops to a block device"
>> +       depends on PSTORE
>> +       help
>> +         This enables panic and oops message to be logged to a block dev
>> +         where it can be read back at some later point.
>> diff --git a/fs/pstore/Makefile b/fs/pstore/Makefile
>> index 967b589..c431700 100644
>> --- a/fs/pstore/Makefile
>> +++ b/fs/pstore/Makefile
>> @@ -12,3 +12,6 @@ pstore-$(CONFIG_PSTORE_PMSG)  += pmsg.o
>>
>>  ramoops-objs += ram.o ram_core.o
>>  obj-$(CONFIG_PSTORE_RAM)       += ramoops.o
>> +
>> +obj-$(CONFIG_PSTORE_BLK)       += pstore_blk.o
>> +pstore_blk-y += blkzone.o
>> diff --git a/fs/pstore/blkzone.c b/fs/pstore/blkzone.c
>> new file mode 100644
>> index 0000000..e1b7f26
>> --- /dev/null
>> +++ b/fs/pstore/blkzone.c
>> @@ -0,0 +1,958 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + *
>> + * blkzone.c: Block device Oops/Panic logger
>> + *
>> + * Copyright (C) 2019 liaoweixiong <liaoweixiong@...lwinnertech.com>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License version 2 as
>> + * published by the Free Software Foundation.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
>> + * GNU General Public License for more details.
>> + *
>> + */
>> +
>> +#define pr_fmt(fmt) "blkzone: " fmt
> 
> To follow with "ramoops", maybe this should be "blkoops"? I don't have
> a strong opinion, but "zone" is only used internally in ram.c for the
> memory segments (zones), so it's not clear to me if "blkzone" is
> meaningful to a given pstore end user.
> 

You are right, "zone" will cause misunderstanding. How about "pstore_blk"?
I change it as follow:

#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt

>> +
>> +#include <linux/kernel.h>
>> +#include <linux/module.h>
>> +#include <linux/slab.h>
>> +#include <linux/blkdev.h>
>> +#include <linux/pstore.h>
>> +#include <linux/mount.h>
>> +#include <linux/fs.h>
>> +#include <linux/pstore_blk.h>
>> +
>> +/**
>> + * struct blkz_head - head of zone to flush to storage
>> + *
>> + * @sig: signature to indicate header (BLK_SIG xor BLKZONE-type value)
>> + * @datalen: length of data in @data
>> + * @data: zone data.
>> + */
>> +struct blkz_buffer {
>> +#define BLK_SIG (0x43474244) /* DBGC */
>> +       uint32_t sig;
>> +       atomic_t datalen;
>> +       uint8_t data[0];
> 
> nit: I prefer this to be [] instead of [0], but technically is doesn't
> matter. :)
> 

Both of them don't matter to me, i will change it if you prefer.

>> +};
>> +
>> +/**
>> + * sruct blkz_dmesg_header: dmesg information
> 
> typo: sruct -> struct
> 

Fixed.

>> + *
>> + * @magic: magic num for dmesg header
>> + * @time: trigger time
>> + * @compressed: whether conpressed
>> + * @count: oops/panic counter
>> + * @reason: identify oops or panic
>> + */
>> +struct blkz_dmesg_header {
>> +#define DMESG_HEADER_MAGIC 0x4dfc3ae5
>> +       uint32_t magic;
>> +       struct timespec64 time;
>> +       bool compressed;
>> +       uint32_t counter;
>> +       enum kmsg_dump_reason reason;
>> +       uint8_t data[0];
> 
> Same.
> 

What's wrong here?

>> +};
>> +
>> +/**
>> + * struct blkz_zone - zone information
>> + * @off:
>> + *     zone offset of partition
>> + * @type:
>> + *     frontent type for this zone
>> + * @name:
>> + *     frontent name for this zone
>> + * @buffer:
>> + *     pointer to data buffer managed by this zone
>> + * @buffer_size:
>> + *     bytes in @buffer->data
>> + * @should_recover:
>> + *     should recover from storage
>> + * @dirty:
>> + *     mark whether the data in @buffer are dirty (not flush to storage yet)
>> + */
>> +struct blkz_zone {
>> +       unsigned long off;
>> +       const char *name;
>> +       enum pstore_type_id type;
>> +
>> +       struct blkz_buffer *buffer;
>> +       size_t buffer_size;
>> +       bool should_recover;
>> +       atomic_t dirty;
>> +};
>> +
>> +struct blkoops_context {
>> +       struct blkz_zone **dbzs;        /* dmesg block zones */
>> +       unsigned int dmesg_max_cnt;
>> +       unsigned int dmesg_read_cnt;
>> +       unsigned int dmesg_write_cnt;
>> +       /**
>> +        * the counter should be recovered when do recovery
>> +        * It records the oops/panic times after burning rather than booting.
>> +        */
>> +       unsigned int oops_counter;
>> +       unsigned int panic_counter;
>> +       atomic_t blkdev_up;
>> +       atomic_t recovery;
>> +       atomic_t on_panic;
>> +
>> +       /*
>> +        * bzinfo_lock just protects "bzinfo" during calls to
>> +        * blkz_register/blkz_unregister
>> +        */
>> +       spinlock_t bzinfo_lock;
>> +       struct blkz_info *bzinfo;
>> +       struct pstore_info pstore;
>> +};
>> +static struct blkoops_context blkz_cxt;
>> +
>> +enum blkz_flush_mode {
>> +       FLUSH_NONE = 0,
>> +       FLUSH_PART,
>> +       FLUSH_META,
>> +       FLUSH_ALL,
>> +};
>> +
>> +static inline int buffer_datalen(struct blkz_zone *zone)
>> +{
>> +       return atomic_read(&zone->buffer->datalen);
>> +}
>> +
>> +static inline bool is_on_panic(void)
>> +{
>> +       struct blkoops_context *cxt = &blkz_cxt;
>> +
>> +       return atomic_read(&cxt->on_panic);
>> +}
>> +
>> +static inline bool is_blkdev_up(void)
>> +{
>> +       struct blkoops_context *cxt = &blkz_cxt;
>> +       const char *devpath = cxt->bzinfo->part_path;
>> +
>> +       if (atomic_read(&cxt->blkdev_up))
>> +               return true;
>> +       if (is_on_panic())
>> +               goto set_up;
>> +       if (devpath && !name_to_dev_t(devpath))
>> +               return false;
>> +set_up:
>> +       atomic_set(&cxt->blkdev_up, 1);
>> +       return true;
>> +}
>> +
>> +static int blkz_zone_read(struct blkz_zone *zone, char *buf,
>> +               size_t len, unsigned long off)
>> +{
>> +       if (!buf || !zone->buffer)
>> +               return -EINVAL;
>> +       len = min((size_t)len, (size_t)(zone->buffer_size - off));
> 
> I'd rather this had an explicit overflow test before the min(): make
> sure "off" is not > buffer_size. (And probably you want min_t(size_t,
> ...) instead of the double-cast.
> 

Fixed

>> +       memcpy(buf, zone->buffer->data + off, len);
>> +       return 0;
>> +}
>> +
>> +static int blkz_zone_write(struct blkz_zone *zone,
>> +               enum blkz_flush_mode flush_mode, const char *buf,
>> +               size_t len, unsigned long off)
>> +{
>> +       struct blkz_info *info = blkz_cxt.bzinfo;
>> +       ssize_t wcnt;
>> +       ssize_t (*writeop)(const char *buf, size_t bytes, loff_t pos);
>> +       size_t wlen;
>> +
>> +       wlen = min((size_t)len, (size_t)(zone->buffer_size - off));
> 
> Same overflow test here.
> 

Fixed

>> +       if (flush_mode != FLUSH_META && flush_mode != FLUSH_NONE) {
>> +               if (buf && zone->buffer)
>> +                       memcpy(zone->buffer->data + off, buf, wlen);
>> +               atomic_set(&zone->buffer->datalen, wlen + off);
>> +       }
>> +
>> +       if (!is_blkdev_up())
>> +               goto set_dirty;
>> +
>> +       writeop = is_on_panic() ? info->panic_write : info->write;
>> +       if (!writeop)
>> +               return -EINVAL;
>> +
>> +       switch (flush_mode) {
>> +       case FLUSH_NONE:
>> +               return 0;
>> +       case FLUSH_PART:
>> +               wcnt = writeop((const char *)zone->buffer->data + off, wlen,
>> +                               zone->off + sizeof(*zone->buffer) + off);
>> +               if (wcnt != wlen)
>> +                       goto set_dirty;
>> +       case FLUSH_META:
>> +               wlen = sizeof(struct blkz_buffer);
>> +               wcnt = writeop((const char *)zone->buffer, wlen, zone->off);
>> +               if (wcnt != wlen)
>> +                       goto set_dirty;
>> +               break;
>> +       case FLUSH_ALL:
>> +               wlen = buffer_datalen(zone) + sizeof(*zone->buffer);
>> +               wcnt = writeop((const char *)zone->buffer, wlen, zone->off);
>> +               if (wcnt != wlen)
>> +                       goto set_dirty;
>> +               break;
>> +       }
>> +
>> +       return 0;
>> +set_dirty:
>> +       atomic_set(&zone->dirty, true);
>> +       return -EBUSY;
>> +}
>> +
>> +/*
>> + * blkz_move_zone: move data from a old zone to a new zone
>> + *
>> + * @old: the old zone
>> + * @new: the new zone
>> + *
>> + * NOTE:
>> + *     Call blkz_zone_write to copy and flush data. If it failed, we
>> + *     should reset new->dirty, because the new zone not realy dirty.
> 
> typo: realy -> really
> 

Fiexed.

>> + */
>> +static int blkz_move_zone(struct blkz_zone *old, struct blkz_zone *new)
>> +{
>> +       const char *data = (const char *)old->buffer->data;
>> +       int ret;
>> +
>> +       ret = blkz_zone_write(new, FLUSH_ALL, data, buffer_datalen(old), 0);
>> +       if (ret) {
>> +               atomic_set(&new->buffer->datalen, 0);
>> +               atomic_set(&new->dirty, false);
>> +               return ret;
>> +       }
>> +       atomic_set(&old->buffer->datalen, 0);
>> +       return 0;
>> +}
>> +
>> +static int blkz_recover_dmesg_data(struct blkoops_context *cxt)
>> +{
>> +       struct blkz_info *info = cxt->bzinfo;
>> +       struct blkz_zone *zone = NULL;
>> +       struct blkz_buffer *buf;
>> +       unsigned long i;
>> +       ssize_t (*readop)(char *buf, size_t bytes, loff_t pos);
>> +       ssize_t rcnt;
>> +
>> +       readop = is_on_panic() ? info->panic_read : info->read;
>> +       if (!readop)
>> +               return -EINVAL;
>> +
>> +       for (i = 0; i < cxt->dmesg_max_cnt; i++) {
>> +               zone = cxt->dbzs[i];
>> +               if (unlikely(!zone))
>> +                       return -EINVAL;
>> +               if (atomic_read(&zone->dirty)) {
>> +                       unsigned int wcnt = cxt->dmesg_write_cnt;
>> +                       struct blkz_zone *new = cxt->dbzs[wcnt];
>> +                       int ret;
>> +
>> +                       ret = blkz_move_zone(zone, new);
>> +                       if (ret) {
>> +                               pr_err("move zone from %lu to %d failed\n",
>> +                                               i, wcnt);
>> +                               return ret;
>> +                       }
>> +                       cxt->dmesg_write_cnt = (wcnt + 1) % cxt->dmesg_max_cnt;
>> +               }
>> +               if (!zone->should_recover)
>> +                       continue;
>> +               buf = zone->buffer;
>> +               rcnt = readop((char *)buf, zone->buffer_size + sizeof(*buf),
>> +                               zone->off);
>> +               if (rcnt != zone->buffer_size + sizeof(*buf))
>> +                       return (int)rcnt < 0 ? (int)rcnt : -EIO;
>> +       }
>> +       return 0;
>> +}
>> +
>> +/*
> 
> typo: for kern-doc the opening comment should be /**
> 

Fixed.

>> + * blkz_recover_dmesg_meta: recover meta datas of dmesg
> 
> typo: meta datas -> metadata ("data" is already plural)
> 

Fixed.

>> + *
>> + * Recover datas as follow:
> 
> Should this be "metadata" ?
> 

Sure.

>> + * @cxt->dmesg_write_cnt
>> + * @cxt->oops_counter
>> + * @cxt->panic_counter
>> + */
>> +static int blkz_recover_dmesg_meta(struct blkoops_context *cxt)
>> +{
>> +       struct blkz_info *info = cxt->bzinfo;
>> +       struct blkz_zone *zone;
>> +       size_t rcnt, len;
>> +       struct blkz_buffer *buf;
>> +       struct blkz_dmesg_header *hdr;
>> +       ssize_t (*readop)(char *buf, size_t bytes, loff_t pos);
>> +       struct timespec64 time = {0};
>> +       unsigned long i;
>> +       /**
> 
> typo: double-* not needed here
> 

Fixed.

>> +        * Recover may on panic, we can't allocate any memory by kmalloc.
>> +        * So, we use local array instead.
>> +        */
>> +       char buffer_header[sizeof(*buf) + sizeof(*hdr)] = {0};
>> +
>> +       readop = is_on_panic() ? info->panic_read : info->read;
>> +       if (!readop)
>> +               return -EINVAL;
>> +
>> +       len = sizeof(*buf) + sizeof(*hdr);
>> +       buf = (struct blkz_buffer *)buffer_header;
>> +       for (i = 0; i < cxt->dmesg_max_cnt; i++) {
>> +               zone = cxt->dbzs[i];
>> +               if (unlikely(!zone))
>> +                       return -EINVAL;
>> +
>> +               rcnt = readop((char *)buf, len, zone->off);
>> +               if (rcnt != len)
>> +                       return (int)rcnt < 0 ? (int)rcnt : -EIO;
>> +
>> +               /*
>> +                * If sig NOT match, it means this zone never used before,
>> +                * because we write one by one, and we never modify sig even
>> +                * when erase. So, we do not need to check next one.
>> +                */
>> +               if (buf->sig != zone->buffer->sig) {
>> +                       cxt->dmesg_write_cnt = i;
>> +                       pr_debug("no valid data in dmesg zone %lu\n", i);
>> +                       break;
>> +               }
>> +
>> +               if (zone->buffer_size < atomic_read(&buf->datalen)) {
>> +                       pr_info("found overtop zone: %s: id %lu, off %lu, size %zu\n",
>> +                                       zone->name, i, zone->off,
>> +                                       zone->buffer_size);
>> +                       continue;
>> +               }
>> +
>> +               hdr = (struct blkz_dmesg_header *)buf->data;
>> +               if (hdr->magic != DMESG_HEADER_MAGIC) {
>> +                       pr_info("found invalid zone: %s: id %lu, off %lu, size %zu\n",
>> +                                       zone->name, i, zone->off,
>> +                                       zone->buffer_size);
>> +                       continue;
>> +               }
>> +
>> +               /*
>> +                * we get the newest zone, and the next one must be the oldest
>> +                * or unused zone, because we do write one by one like a circle.
>> +                */
>> +               if (hdr->time.tv_sec >= time.tv_sec) {
>> +                       time.tv_sec = hdr->time.tv_sec;
>> +                       cxt->dmesg_write_cnt = (i + 1) % cxt->dmesg_max_cnt;
>> +               }
>> +
>> +               if (hdr->reason == KMSG_DUMP_OOPS)
>> +                       cxt->oops_counter =
>> +                               max(cxt->oops_counter, hdr->counter);
>> +               else
>> +                       cxt->panic_counter =
>> +                               max(cxt->panic_counter, hdr->counter);
>> +
>> +               if (!atomic_read(&buf->datalen)) {
>> +                       pr_debug("found erased zone: %s: id %ld, off %lu, size %zu, datalen %d\n",
>> +                                       zone->name, i, zone->off,
>> +                                       zone->buffer_size,
>> +                                       atomic_read(&buf->datalen));
>> +                       continue;
>> +               }
>> +
>> +               zone->should_recover = true;

It is useless to recover (read old data from block device) when on
panic. So, I modify it as fallow:
                if (!is_on_panic())

                        zone->should_recover = true;

>> +               pr_debug("found nice zone: %s: id %ld, off %lu, size %zu, datalen %d\n",
>> +                               zone->name, i, zone->off,
>> +                               zone->buffer_size, atomic_read(&buf->datalen));
>> +       }
>> +
>> +       return 0;
>> +}
>> +
>> +static int blkz_recover_dmesg(struct blkoops_context *cxt)
>> +{
>> +       int ret;
>> +
>> +       if (!cxt->dbzs)
>> +               return 0;
>> +
>> +       ret = blkz_recover_dmesg_meta(cxt);
>> +       if (ret)
>> +               goto recover_fail;
>> +
>> +       ret = blkz_recover_dmesg_data(cxt);
>> +       if (ret)
>> +               goto recover_fail;
>> +
>> +       return 0;
>> +recover_fail:
>> +       pr_debug("recovery dmesg failed\n");
>> +       return ret;
>> +}
>> +
>> +static inline int blkz_recovery(struct blkoops_context *cxt)
>> +{
>> +       int ret = -EBUSY;
>> +
>> +       if (atomic_read(&cxt->recovery))
>> +               return 0;
>> +
>> +       if (!is_blkdev_up())
>> +               goto recover_fail;
>> +
>> +       ret = blkz_recover_dmesg(cxt);
>> +       if (ret)
>> +               goto recover_fail;
>> +
>> +       atomic_set(&cxt->recovery, 1);
>> +       pr_debug("recover end!\n");
>> +       return 0;
>> +
>> +recover_fail:
>> +       pr_debug("recovery failed, handle buffer\n");
>> +       return ret;
>> +}
>> +
>> +static int blkoops_pstore_open(struct pstore_info *psi)
>> +{
>> +       struct blkoops_context *cxt = psi->data;
>> +
>> +       cxt->dmesg_read_cnt = 0;
>> +       return 0;
>> +}
>> +
>> +static inline bool blkz_ok(struct blkz_zone *zone)
>> +{
>> +       if (!zone || !zone->buffer || !buffer_datalen(zone))
>> +               return false;
>> +       return true;
>> +}
>> +
>> +static int blkoops_pstore_erase(struct pstore_record *record)
>> +{
>> +       struct blkoops_context *cxt = record->psi->data;
>> +       struct blkz_zone *zone = NULL;
>> +
>> +       /*
>> +        * before read/erase, we must recover from storage.
>> +        * if recover failed, handle buffer
>> +        */
>> +       blkz_recovery(cxt);
>> +

Pstore/blk should not do recovery when erase. So, delete it.
The reason for recovery when read/write are explained below.

>> +       if (record->type == PSTORE_TYPE_DMESG)
>> +               zone = cxt->dbzs[record->id];
>> +       if (!blkz_ok(zone))
>> +               return 0;
>> +
>> +       atomic_set(&zone->buffer->datalen, 0);
>> +       return blkz_zone_write(zone, FLUSH_META, NULL, 0, 0);
>> +}
>> +
>> +static void blkoops_write_kmsg_hdr(struct blkz_zone *zone,
>> +               struct pstore_record *record)
>> +{
>> +       struct blkoops_context *cxt = record->psi->data;
>> +       struct blkz_buffer *buffer = zone->buffer;
>> +       struct blkz_dmesg_header *hdr =
>> +               (struct blkz_dmesg_header *)buffer->data;
>> +
>> +       hdr->magic = DMESG_HEADER_MAGIC;
>> +       hdr->compressed = record->compressed;
>> +       hdr->time.tv_sec = record->time.tv_sec;
>> +       hdr->time.tv_nsec = record->time.tv_nsec;
>> +       hdr->reason = record->reason;
>> +       if (hdr->reason == KMSG_DUMP_OOPS)
>> +               hdr->counter = ++cxt->oops_counter;
>> +       else
>> +               hdr->counter = ++cxt->panic_counter;
>> +}
>> +
>> +static int notrace blkz_dmesg_write(struct blkoops_context *cxt,
>> +               struct pstore_record *record)
>> +{
>> +       struct blkz_info *info = cxt->bzinfo;
>> +       struct blkz_zone *zone;
>> +       size_t size, hlen;
>> +
>> +       /*
>> +        * Out of the various dmesg dump types, ramoops is currently designed
> 
> Heh, copy/paste-o from ramoops. :) ramoops -> blkoops ?
> 

Sure.

>> +        * to only store crash logs, rather than storing general kernel logs.
>> +        */
>> +       if (record->reason != KMSG_DUMP_OOPS &&
>> +                       record->reason != KMSG_DUMP_PANIC)
>> +               return -EINVAL;
>> +
>> +       /* Skip Oopes when configured to do so. */
>> +       if (record->reason == KMSG_DUMP_OOPS && !info->dump_oops)
>> +               return -EINVAL;
>> +
>> +       /*
>> +        * Explicitly only take the first part of any new crash.
>> +        * If our buffer is larger than kmsg_bytes, this can never happen,
>> +        * and if our buffer is smaller than kmsg_bytes, we don't want the
>> +        * report split across multiple records.
>> +        */
>> +       if (record->part != 1)
>> +               return -ENOSPC;
>> +
>> +       if (!cxt->dbzs)
>> +               return -ENOSPC;
>> +
>> +       zone = cxt->dbzs[cxt->dmesg_write_cnt];
>> +       if (!zone)
>> +               return -ENOSPC;
>> +
>> +       blkoops_write_kmsg_hdr(zone, record);
>> +       hlen = sizeof(struct blkz_dmesg_header);
>> +       size = record->size;
>> +       if (size + hlen > zone->buffer_size)
>> +               size = zone->buffer_size - hlen;
>> +       blkz_zone_write(zone, FLUSH_ALL, record->buf, size, hlen);
>> +
>> +       pr_debug("write %s to zone id %d\n", zone->name, cxt->dmesg_write_cnt);
>> +       cxt->dmesg_write_cnt = (cxt->dmesg_write_cnt + 1) % cxt->dmesg_max_cnt;
>> +       return 0;
>> +}
>> +
>> +static int notrace blkoops_pstore_write(struct pstore_record *record)
>> +{
>> +       struct blkoops_context *cxt = record->psi->data;
>> +
>> +       if (record->type == PSTORE_TYPE_DMESG &&
>> +                       record->reason == KMSG_DUMP_PANIC)
>> +               atomic_set(&cxt->on_panic, 1);
>> +
>> +       /*
>> +        * before read/erase, we must recover from storage.
>> +        * if recover failed, handle buffer
>> +        */
>> +       blkz_recovery(cxt);
> 
> I don't understand this part: you're doing reads before the write? Can
> you clarify this a bit? (Or maybe, describe what "recovery" is? I
> thought it was just reading the blk records?)
> 

Recover the old data from block device by reading.
Why should pstore/blk need it?
1.  pstore/blk do NOT recover old data when initialize, because block
device may be unready, which is different with ram. So, pstore/blk do
recovery by general read/write, through io stack, when write (kernel
initialize end but not mount pstore yet) and read (when mount pstore).
2.  pstore/blk should find the next zone it used for dmesg. We can't
overwrite the first zone recklessly. Pstore/blk reads from block device
and finds the next free zone or the oldest zone. That is what
blkz_recovery->blkz_recover_dmesg->blkz_recover_dmesg_meta do.

How is block device ready?
It means the io stack initlized end (when kernel initialize end). The
general read/write for pstore/blk, except panic, should not destroy io
stack.
Panic_read/panic_write may destroy io stack, which means general
read/write will not work after panic operations. So, pstore/blk will use
panic_read/write for recovery only when on panic.
Besides, general read/write, through io stack (by function
kernel_read/kernel_write), can be applied to most block devices.

>> +
>> +       switch (record->type) {
>> +       case PSTORE_TYPE_DMESG:
>> +               return blkz_dmesg_write(cxt, record);
>> +       default:
>> +               return -EINVAL;
>> +       }
>> +}
>> +
>> +#define READ_NEXT_ZONE ((ssize_t)(-1024))
>> +static struct blkz_zone *blkz_read_next_zone(struct blkoops_context *cxt)
>> +{
>> +       struct blkz_zone *zone = NULL;
>> +
>> +       while (cxt->dmesg_read_cnt < cxt->dmesg_max_cnt) {
>> +               zone = cxt->dbzs[cxt->dmesg_read_cnt++];
>> +               if (blkz_ok(zone))
>> +                       return zone;
>> +       }
>> +
>> +       return NULL;
>> +}
>> +
>> +static int blkoops_read_dmesg_hdr(struct blkz_zone *zone,
>> +               struct pstore_record *record)
>> +{
>> +       struct blkz_buffer *buffer = zone->buffer;
>> +       struct blkz_dmesg_header *hdr =
>> +               (struct blkz_dmesg_header *)buffer->data;
>> +
>> +       if (hdr->magic != DMESG_HEADER_MAGIC)
>> +               return -EINVAL;
>> +       record->compressed = hdr->compressed;
>> +       record->time.tv_sec = hdr->time.tv_sec;
>> +       record->time.tv_nsec = hdr->time.tv_nsec;
>> +       record->reason = hdr->reason;
>> +       record->count = hdr->counter;
>> +       return 0;
>> +}
>> +
>> +static ssize_t blkz_dmesg_read(struct blkz_zone *zone,
>> +               struct pstore_record *record)
>> +{
>> +       size_t size, hlen = 0;
>> +
>> +       size = buffer_datalen(zone);
>> +       /* Clear and skip this DMESG record if it has no valid header */
>> +       if (blkoops_read_dmesg_hdr(zone, record)) {
>> +               atomic_set(&zone->buffer->datalen, 0);
>> +               atomic_set(&zone->dirty, 0);
>> +               return READ_NEXT_ZONE;
>> +       }
>> +       size -= sizeof(struct blkz_dmesg_header);
>> +
>> +       if (!record->compressed) {
>> +               char *buf = kasprintf(GFP_KERNEL,
>> +                               "blkoops: %s: Total %d times\n",
>> +                               record->reason == KMSG_DUMP_OOPS ? "Oops" :
>> +                               "Panic", record->count);
>> +               hlen = strlen(buf);
>> +               record->buf = krealloc(buf, hlen + size, GFP_KERNEL);
>> +               if (!record->buf) {
>> +                       kfree(buf);
>> +                       return -ENOMEM;
>> +               }
>> +       } else {
>> +               record->buf = kmalloc(size, GFP_KERNEL);
>> +               if (!record->buf)
>> +                       return -ENOMEM;
>> +       }
>> +
>> +       if (unlikely(blkz_zone_read(zone, record->buf + hlen, size,
>> +                               sizeof(struct blkz_dmesg_header)) < 0)) {
>> +               kfree(record->buf);
>> +               return READ_NEXT_ZONE;
>> +       }
>> +
>> +       return size + hlen;
>> +}
>> +
>> +static ssize_t blkoops_pstore_read(struct pstore_record *record)
>> +{
>> +       struct blkoops_context *cxt = record->psi->data;
>> +       ssize_t (*blkz_read)(struct blkz_zone *zone,
>> +                       struct pstore_record *record);
>> +       struct blkz_zone *zone;
>> +       ssize_t ret;
>> +
>> +       /*
>> +        * before read/erase, we must recover from storage.
>> +        * if recover failed, handle buffer
>> +        */
>> +       blkz_recovery(cxt);
>> +
>> +next_zone:
>> +       zone = blkz_read_next_zone(cxt);
>> +       if (!zone)
>> +               return 0;
>> +
>> +       record->id = 0;
>> +       record->type = zone->type;
>> +
>> +       record->time.tv_sec = 0;
>> +       record->time.tv_nsec = 0;
>> +       record->compressed = false;
> 
> None of the zeroing is needed: platform.c already zero-initializes,
> populates record->psi and record->time.
> 

Yes, you are right. It was copied from ram.c, function
ramoops_pstore_read. May ramoops should delete it too? Besides, pstore
has repetive zone-initialized record. pstore allocate memory by
"kzalloc" in funcion pstore_get_backend_records, then zero-initialize
again in function pstore_record_init.

>> +
>> +       switch (record->type) {
>> +       case PSTORE_TYPE_DMESG:
>> +               blkz_read = blkz_dmesg_read;
>> +               record->id = cxt->dmesg_read_cnt - 1;
>> +               break;
>> +       default:
>> +               goto next_zone;
>> +       }
>> +
>> +       ret = blkz_read(zone, record);
>> +       if (ret == READ_NEXT_ZONE)
>> +               goto next_zone;
>> +       return ret;
>> +}
>> +
>> +static struct blkoops_context blkz_cxt = {
>> +       .bzinfo_lock = __SPIN_LOCK_UNLOCKED(blkz_cxt.bzinfo_lock),
>> +       .blkdev_up = ATOMIC_INIT(0),
>> +       .recovery = ATOMIC_INIT(0),
>> +       .on_panic = ATOMIC_INIT(0),
>> +       .pstore = {
>> +               .owner = THIS_MODULE,
>> +               .name = "blkoops",
>> +               .open = blkoops_pstore_open,
>> +               .read = blkoops_pstore_read,
>> +               .write = blkoops_pstore_write,
>> +               .erase = blkoops_pstore_erase,
>> +       },
>> +};
>> +
>> +static ssize_t blkz_sample_read(char *buf, size_t bytes, loff_t pos)
>> +{
>> +       struct blkoops_context *cxt = &blkz_cxt;
>> +       const char *devpath = cxt->bzinfo->part_path;
>> +       struct file *filp;
>> +       ssize_t ret;
>> +
>> +       if (!devpath)
>> +               return -EINVAL;
>> +
>> +       if (!is_blkdev_up())
>> +               return -EBUSY;
>> +
>> +       filp = filp_open(devpath, O_RDONLY, 0);
>> +       if (IS_ERR(filp)) {
>> +               pr_debug("open %s failed, maybe unready\n", devpath);
>> +               return -EACCES;
>> +       }
>> +       ret = kernel_read(filp, buf, bytes, &pos);
>> +       filp_close(filp, NULL);
>> +
>> +       return ret;
>> +}
>> +
>> +static ssize_t blkz_sample_write(const char *buf, size_t bytes, loff_t pos)
>> +{
>> +       struct blkoops_context *cxt = &blkz_cxt;
>> +       const char *devpath = cxt->bzinfo->part_path;
>> +       struct file *filp;
>> +       ssize_t ret;
>> +
>> +       if (!devpath)
>> +               return -EINVAL;
>> +
>> +       if (!is_blkdev_up())
>> +               return -EBUSY;
>> +
>> +       filp = filp_open(devpath, O_WRONLY, 0);
>> +       if (IS_ERR(filp)) {
>> +               pr_debug("open %s failed, maybe unready\n", devpath);
>> +               return -EACCES;
>> +       }
>> +       ret = kernel_write(filp, buf, bytes, &pos);
>> +       vfs_fsync(filp, 0);
>> +       filp_close(filp, NULL);
>> +
>> +       return ret;
>> +}
>> +
>> +static struct blkz_zone *blkz_init_zone(enum pstore_type_id type,
>> +               unsigned long *off, size_t size)
>> +{
>> +       struct blkz_info *info = blkz_cxt.bzinfo;
>> +       struct blkz_zone *zone;
>> +       const char *name = pstore_type_to_name(type);
>> +
>> +       if (!size)
>> +               return NULL;
>> +
>> +       if (*off + size > info->part_size) {
>> +               pr_err("no room for %s (0x%zx@...lx over 0x%lx)\n",
>> +                       name, size, *off, info->part_size);
>> +               return ERR_PTR(-ENOMEM);
>> +       }
>> +
>> +       zone = kzalloc(sizeof(struct blkz_zone), GFP_KERNEL);
>> +       if (!zone)
>> +               return ERR_PTR(-ENOMEM);
>> +
>> +       /**
>> +        * NOTE: allocate buffer for blk zones for two reasons:
>> +        * 1. It can temporarily hold the data before sample_read/write
>> +        *    are useable.
>> +        * 2. It makes pstore usable even if no persistent storage. Most
>> +        *    events of pstore except panic are suitable!!
>> +        */
>> +       zone->buffer = kmalloc(size, GFP_KERNEL);
>> +       if (!zone->buffer) {
>> +               kfree(zone);
>> +               return ERR_PTR(-ENOMEM);
>> +       }
>> +       memset(zone->buffer, 0xFF, size);
>> +       zone->off = *off;
>> +       zone->name = name;
>> +       zone->type = type;
>> +       zone->buffer_size = size - sizeof(struct blkz_buffer);
>> +       zone->buffer->sig = type ^ BLK_SIG;
>> +       atomic_set(&zone->dirty, 0);
>> +       atomic_set(&zone->buffer->datalen, 0);
>> +
>> +       *off += size;
>> +
>> +       pr_debug("blkzone %s: off 0x%lx, %zu header, %zu data\n", zone->name,
>> +                       zone->off, sizeof(*zone->buffer), zone->buffer_size);
>> +       return zone;
>> +}
>> +
>> +static struct blkz_zone **blkz_init_zones(enum pstore_type_id type,
>> +       unsigned long *off, size_t total_size, ssize_t record_size,
>> +       unsigned int *cnt)
>> +{
>> +       struct blkz_info *info = blkz_cxt.bzinfo;
>> +       struct blkz_zone **zones, *zone;
>> +       const char *name = pstore_type_to_name(type);
>> +       int c, i;
>> +
>> +       if (!total_size || !record_size)
>> +               return NULL;
>> +
>> +       if (*off + total_size > info->part_size) {
>> +               pr_err("no room for zones %s (0x%zx@...lx over 0x%lx)\n",
>> +                       name, total_size, *off, info->part_size);
>> +               return ERR_PTR(-ENOMEM);
>> +       }
>> +
>> +       c = total_size / record_size;
>> +       zones = kcalloc(c, sizeof(*zones), GFP_KERNEL);
>> +       if (!zones) {
>> +               pr_err("allocate for zones %s failed\n", name);
>> +               return ERR_PTR(-ENOMEM);
>> +       }
>> +       memset(zones, 0, c * sizeof(*zones));
>> +
>> +       for (i = 0; i < c; i++) {
>> +               zone = blkz_init_zone(type, off, record_size);
>> +               if (!zone || IS_ERR(zone)) {
>> +                       pr_err("initialize zones %s failed\n", name);
>> +                       while (--i >= 0)
>> +                               kfree(zones[i]);
>> +                       kfree(zones);
>> +                       return (void *)zone;
>> +               }
>> +               zones[i] = zone;
>> +       }
>> +
>> +       *cnt = c;
>> +       return zones;
>> +}
>> +
>> +static void blkz_free_zone(struct blkz_zone **blkzone)
>> +{
>> +       struct blkz_zone *zone = *blkzone;
>> +
>> +       if (!zone)
>> +               return;
>> +
>> +       kfree(zone->buffer);
>> +       kfree(zone);
>> +       *blkzone = NULL;
>> +}
>> +
>> +static void blkz_free_zones(struct blkz_zone ***blkzones, unsigned int *cnt)
>> +{
>> +       struct blkz_zone **zones = *blkzones;
>> +
>> +       while (*cnt > 0) {
>> +               blkz_free_zone(&zones[*cnt]);
>> +               (*cnt)--;
>> +       }
>> +       kfree(zones);
>> +       *blkzones = NULL;
>> +}
>> +
>> +static int blkz_cut_zones(struct blkoops_context *cxt)
>> +{
>> +       struct blkz_info *info = cxt->bzinfo;
>> +       unsigned long off = 0;
>> +       int err;
>> +       size_t size;
>> +
>> +       size = info->part_size;
>> +       cxt->dbzs = blkz_init_zones(PSTORE_TYPE_DMESG, &off, size,
>> +                       info->dmesg_size, &cxt->dmesg_max_cnt);
>> +       if (IS_ERR(cxt->dbzs)) {
>> +               err = PTR_ERR(cxt->dbzs);
>> +               goto fail_out;
>> +       }
>> +
>> +       return 0;
>> +fail_out:
>> +       return err;
>> +}
>> +
>> +int blkz_register(struct blkz_info *info)
>> +{
>> +       int err = -EINVAL;
>> +       struct blkoops_context *cxt = &blkz_cxt;
>> +       struct module *owner = info->owner;
>> +
>> +       if (!info->part_size || !info->dmesg_size) {
>> +               pr_warn("The memory size and the dmesg size must be non-zero\n");
>> +               return -EINVAL;
>> +       }
>> +
>> +       if (info->part_size < 4096) {
>> +               pr_err("partition size must be over 4096 bytes\n");
>> +               return -EINVAL;
>> +       }
>> +
>> +#define check_size(name, size) {                                       \
>> +               if (info->name & (size)) {                              \
>> +                       pr_err(#name " must be a multiple of %d\n",     \
>> +                                       (size));                        \
>> +                       return -EINVAL;                                 \
>> +               }                                                       \
>> +       }
>> +
>> +       check_size(part_size, 4096 - 1);
>> +       check_size(dmesg_size, SECTOR_SIZE - 1);
>> +
>> +#undef check_size
>> +
>> +       if (!info->read)
>> +               info->read = blkz_sample_read;
>> +       if (!info->write)
>> +               info->write = blkz_sample_write;
> 
> Can you explain the write vs panic_write difference? And "sample"
> seems unusual here -- maybe "default"? 
I rename function to "blkz_default_general_read/write".
I have explain the difference of them above.

> Are there other handlers you imagine that wouldn't want to use these
> sample handlers?

It is suitable for some device don't want to do general read/write
through io stack. They implement general read/write by themselves. For
example, if developers think read/write through complete io stack is
slowly, they can do it on device driver. Or some developers reserve some
storage space, which will not register as block device to kernel, to
avoid damaging by others. The reserved space can not be operated by
sample handlers.

>> +
>> +       if (owner && !try_module_get(owner))
>> +               return -EINVAL;
>> +
>> +       spin_lock(&cxt->bzinfo_lock);
>> +       if (cxt->bzinfo) {
>> +               pr_warn("blk '%s' already loaded: ignoring '%s'\n",
>> +                               cxt->bzinfo->name, info->name);
>> +               spin_unlock(&cxt->bzinfo_lock);
>> +               return -EBUSY;
>> +       }
>> +       cxt->bzinfo = info;
>> +       spin_unlock(&cxt->bzinfo_lock);
>> +
>> +       if (blkz_cut_zones(cxt)) {
>> +               pr_err("cut zones fialed\n");
>> +               goto fail_out;
>> +       }
>> +
>> +       cxt->pstore.bufsize = cxt->dbzs[0]->buffer_size -
>> +                       sizeof(struct blkz_dmesg_header);
>> +       cxt->pstore.buf = kzalloc(cxt->pstore.bufsize, GFP_KERNEL);
>> +       if (!cxt->pstore.buf) {
>> +               pr_err("cannot allocate pstore crash dump buffer\n");
>> +               err = -ENOMEM;
>> +               goto fail_out;
>> +       }
>> +       cxt->pstore.data = cxt;
>> +       cxt->pstore.flags = PSTORE_FLAGS_DMESG;
>> +
>> +       err = pstore_register(&cxt->pstore);
>> +       if (err) {
>> +               pr_err("registering with pstore failed\n");
>> +               goto free_pstore_buf;
>> +       }
>> +
>> +       pr_info("Registered %s as blkzone backend for %s%s\n", info->name,
>> +                       cxt->dbzs && cxt->bzinfo->dump_oops ? "Oops " : "",
>> +                       cxt->dbzs && cxt->bzinfo->panic_write ? "Panic " : "");
>> +
> 
> So if panic_write isn't registered, panics will be dropped?
> 

Yes. General write is not suitable for panic, that is why there are two
types of block operations, general and panic.

>> +       module_put(owner);
>> +       return 0;
>> +
>> +free_pstore_buf:
>> +       kfree(cxt->pstore.buf);
>> +fail_out:
>> +       spin_lock(&blkz_cxt.bzinfo_lock);
>> +       blkz_cxt.bzinfo = NULL;
>> +       spin_unlock(&blkz_cxt.bzinfo_lock);
>> +       return err;
>> +}
>> +EXPORT_SYMBOL_GPL(blkz_register);
>> +
>> +void blkz_unregister(struct blkz_info *info)
>> +{
>> +       struct blkoops_context *cxt = &blkz_cxt;
>> +
>> +       pstore_unregister(&cxt->pstore);
>> +       kfree(cxt->pstore.buf);
>> +       cxt->pstore.bufsize = 0;
>> +
>> +       spin_lock(&cxt->bzinfo_lock);
>> +       blkz_cxt.bzinfo = NULL;
>> +       spin_unlock(&cxt->bzinfo_lock);
>> +
>> +       blkz_free_zones(&cxt->dbzs, &cxt->dmesg_max_cnt);
>> +
>> +}
>> +EXPORT_SYMBOL_GPL(blkz_unregister);
>> +
>> +MODULE_LICENSE("GPL");
>> +MODULE_AUTHOR("liaoweixiong <liaoweixiong@...winnertech.com>");
>> +MODULE_DESCRIPTION("Block device Oops/Panic logger");
>> diff --git a/include/linux/pstore_blk.h b/include/linux/pstore_blk.h
>> new file mode 100644
>> index 0000000..426cae4
>> --- /dev/null
>> +++ b/include/linux/pstore_blk.h
> 
> Is there any reason for this file to live in include/linux? I think it
> can just be in fs/pstore/ as blkzone.h or something?
> 

It must live to include/linux as block device driver need it to register
to pstore/blk. The blkz_register is call by block drivers rather than
filesystem, which means the block devices have initialized and are ready
for pstore/blk.

>> @@ -0,0 +1,61 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +
>> +#ifndef __PSTORE_BLK_H_
>> +#define __PSTORE_BLK_H_
>> +
>> +#include <linux/types.h>
>> +#include <linux/blkdev.h>
>> +
>> +#ifndef SECTOR_SIZE
>> +#define SECTOR_SIZE 512
>> +#endif
>> +
>> +/**
>> + * struct blkz_info - backend blkzone driver structure
>> + *
>> + * @owner:
>> + *     module which is responsible for this backend driver
>> + * @name:
>> + *     name of the backend driver
>> + * @part_path:
>> + *     path of a storage partition. It's ok to keep it as NULL
>> + *     if you passing @read and @write in blkz_info. @part_path
>> + *     is needed by stoz_simple_read/write. If both of @part_path,
> 
> stoz -> blkz ?
> 

Fixed.

>> + *     @read and @write are NULL, it will temporarity hold the data
>> + *     in buffer allocated by 'vmalloc'.
>> + * @part_size:
>> + *     total size of a storage partition in bytes. The partition
>> + *     will be used to save data of pstore.
>> + * @dmesg_size:
>> + *     the size of each zones for dmesg (oops & panic).
>> + * @dump_oops:
>> + *     dump oops and panic log or only panic.
>> + * @read:
>> + *     the normal (not panic) read operation. If NULL, replaced as
>> + *     stoz_sample_read. See also @part_path
>> + * @write:
>> + *     the normal (not panic) write operation. If NULL, replaced as
>> + *     stoz_sample_write. See also @part_path
> 
> Both as above?
> 
>> + * @panic_read:
>> + *     the read operation only used for panic.
>> + * @panic_write:
>> + *     the write operation only used for panic.
>> + */
>> +struct blkz_info {
>> +       struct module *owner;
>> +       const char *name;
>> +
>> +       const char *part_path;
>> +       unsigned long part_size;
>> +       unsigned long dmesg_size;
>> +       int dump_oops;
>> +       ssize_t (*read)(char *buf, size_t bytes, loff_t pos);
>> +       ssize_t (*write)(const char *buf, size_t bytes, loff_t pos);
>> +       ssize_t (*panic_read)(char *buf, size_t bytes, loff_t pos);
>> +       ssize_t (*panic_write)(const char *buf, size_t bytes, loff_t pos);
>> +};
>> +
>> +extern int blkz_register(struct blkz_info *info);
>> +extern void blkz_unregister(struct blkz_info *info);
>> +
>> +#endif
>> --
>> 1.9.1
>>
> 
> Slowly making my way through the series... :)
> 
Thank you for your careful review.