linux-kernel - Re: [PATCH v1 2/2] zram: remove init_lock in zram_make

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150130144145.GA2840@blaptop>
Date:	Fri, 30 Jan 2015 23:41:56 +0900
From:	Minchan Kim <minchan@...nel.org>
To:	Sergey Senozhatsky <sergey.senozhatsky.work@...il.com>
Cc:	Sergey Senozhatsky <sergey.senozhatsky@...il.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Linux-MM <linux-mm@...ck.org>, Nitin Gupta <ngupta@...are.org>,
	Jerome Marchand <jmarchan@...hat.com>,
	Ganesh Mahendran <opensource.ganesh@...il.com>
Subject: Re: [PATCH v1 2/2] zram: remove init_lock in zram_make_request

Hello, Sergey

On Thu, Jan 29, 2015 at 04:08:35PM +0900, Sergey Senozhatsky wrote:
> On (01/29/15 15:35), Minchan Kim wrote:
> >
> > As you told, the data was not stable.
> >
> yes. fread test was always slower, and the rest was mostly slower.
> 
> 
> > Anyway, when I read down_read implementation, it's one atomic instruction.
> > Hmm, it seems te be better for srcu_read_lock which does more things.
> >
> srcu looks havier, agree.

ffffffff8172c350 <down_read>:
ffffffff8172c350:       e8 7b 3f 00 00          callq  ffffffff817302d0 <__fentry__>
ffffffff8172c355:       55                      push   %rbp
ffffffff8172c356:       48 89 e5                mov    %rsp,%rbp
ffffffff8172c359:       53                      push   %rbx
ffffffff8172c35a:       48 89 fb                mov    %rdi,%rbx
ffffffff8172c35d:       48 83 ec 08             sub    $0x8,%rsp
ffffffff8172c361:       e8 9a e0 ff ff          callq  ffffffff8172a400 <_cond_resched>
ffffffff8172c366:       48 89 d8                mov    %rbx,%rax
ffffffff8172c369:       f0 48 ff 00             lock incq (%rax)
ffffffff8172c36d:       79 05                   jns    ffffffff8172c374 <down_read+0x24>
ffffffff8172c36f:       e8 5c e7 c4 ff          callq  ffffffff8137aad0 <call_rwsem_down_read_failed>
ffffffff8172c374:       48 83 c4 08             add    $0x8,%rsp
ffffffff8172c378:       5b                      pop    %rbx
ffffffff8172c379:       5d                      pop    %rbp
ffffffff8172c37a:       c3                      retq   


ffffffff810eeec0 <__srcu_read_lock>:
ffffffff810eeec0:       e8 0b 14 64 00          callq  ffffffff817302d0 <__fentry__>
ffffffff810eeec5:       48 8b 07                mov    (%rdi),%rax
ffffffff810eeec8:       55                      push   %rbp
ffffffff810eeec9:       48 89 e5                mov    %rsp,%rbp
ffffffff810eeecc:       83 e0 01                and    $0x1,%eax
ffffffff810eeecf:       48 63 d0                movslq %eax,%rdx
ffffffff810eeed2:       48 8b 4f 08             mov    0x8(%rdi),%rcx
ffffffff810eeed6:       65 48 ff 04 d1          incq   %gs:(%rcx,%rdx,8)
ffffffff810eeedb:       0f ae f0                mfence 
ffffffff810eeede:       48 83 c2 02             add    $0x2,%rdx
ffffffff810eeee2:       48 8b 4f 08             mov    0x8(%rdi),%rcx
ffffffff810eeee6:       65 48 ff 04 d1          incq   %gs:(%rcx,%rdx,8)
ffffffff810eeeeb:       5d                      pop    %rbp
ffffffff810eeeec:       c3                      retq   

Yes, __srcu_read_lock is a little bit heavier but the number of instruction
are not too much difference to make difference 10%. A culprit is
__cond_resched but I don't think, either because our test was CPU intensive
soS I don't think schedule latency affects total bandwidth.

More cuprit is your data pattern.
It seems you didn't use scramble_buffers=0, zero_buffers in fio so that
fio fills random data pattern so zram bandwidth could be different by
compression/decompression ratio.

I did test your fio script adding above options with my 4 CPU real machine
(NOTE, ubuntu fio is old so that it doesn't work well above two options
so I should update fio recently which solves it perfectly)

Another thing about fio is it seems loops option works with write test
with overwrite=1 options while read test doesn't work so that I should
use perf stat -r options to verify stdev.

In addition, I passed first test to remove noise as creating files
and increased testsize as 1G from 400m

1) randread

= vanilla =

 Performance counter stats for 'fio test-fio-randread.txt' (10 runs):

       4713.879241      task-clock (msec)         #    3.160 CPUs utilized            ( +-  0.62% )
             1,131      context-switches          #    0.240 K/sec                    ( +-  2.83% )
                23      cpu-migrations            #    0.005 K/sec                    ( +-  4.40% )
            15,767      page-faults               #    0.003 M/sec                    ( +-  0.03% )
    15,134,497,088      cycles                    #    3.211 GHz                      ( +-  0.15% ) [83.36%]
    10,763,665,604      stalled-cycles-frontend   #   71.12% frontend cycles idle     ( +-  0.22% ) [83.34%]
     6,896,294,076      stalled-cycles-backend    #   45.57% backend  cycles idle     ( +-  0.29% ) [66.67%]
     9,898,608,791      instructions              #    0.65  insns per cycle        
                                                  #    1.09  stalled cycles per insn  ( +-  0.07% ) [83.33%]
     1,852,167,485      branches                  #  392.918 M/sec                    ( +-  0.07% ) [83.34%]
        14,864,143      branch-misses             #    0.80% of all branches          ( +-  0.16% ) [83.34%]

       1.491813361 seconds time elapsed                                          ( +-  0.62% )

= srcu =

 Performance counter stats for 'fio test-fio-randread.txt' (10 runs):

       4752.790715      task-clock (msec)         #    3.166 CPUs utilized            ( +-  0.48% )
             1,179      context-switches          #    0.248 K/sec                    ( +-  1.56% )
                26      cpu-migrations            #    0.005 K/sec                    ( +-  3.91% )
            15,764      page-faults               #    0.003 M/sec                    ( +-  0.02% )
    15,263,869,915      cycles                    #    3.212 GHz                      ( +-  0.25% ) [83.32%]
    10,935,658,177      stalled-cycles-frontend   #   71.64% frontend cycles idle     ( +-  0.38% ) [83.33%]
     7,067,290,320      stalled-cycles-backend    #   46.30% backend  cycles idle     ( +-  0.46% ) [66.64%]
     9,896,513,423      instructions              #    0.65  insns per cycle        
                                                  #    1.11  stalled cycles per insn  ( +-  0.07% ) [83.33%]
     1,847,612,285      branches                  #  388.743 M/sec                    ( +-  0.07% ) [83.38%]
        14,814,815      branch-misses             #    0.80% of all branches          ( +-  0.24% ) [83.37%]

       1.501284082 seconds time elapsed                                          ( +-  0.50% )

srcu is worse as 0.63% but the difference is really marginal.

2) randwrite

= vanilla =

 Performance counter stats for 'fio test-fio-randwrite.txt' (10 runs):

       6283.823490      task-clock (msec)         #    3.332 CPUs utilized            ( +-  0.44% )
             1,536      context-switches          #    0.245 K/sec                    ( +-  2.10% )
                25      cpu-migrations            #    0.004 K/sec                    ( +-  3.79% )
            15,914      page-faults               #    0.003 M/sec                    ( +-  0.02% )
    20,408,942,915      cycles                    #    3.248 GHz                      ( +-  0.40% ) [83.34%]
    14,398,424,739      stalled-cycles-frontend   #   70.55% frontend cycles idle     ( +-  0.62% ) [83.36%]
     9,513,822,555      stalled-cycles-backend    #   46.62% backend  cycles idle     ( +-  0.62% ) [66.65%]
    13,507,376,783      instructions              #    0.66  insns per cycle        
                                                  #    1.07  stalled cycles per insn  ( +-  0.05% ) [83.36%]
     3,155,423,934      branches                  #  502.150 M/sec                    ( +-  0.05% ) [83.34%]
        18,381,090      branch-misses             #    0.58% of all branches          ( +-  0.16% ) [83.34%]

       1.885926070 seconds time elapsed                                          ( +-  0.61% )

= srcu =

 Performance counter stats for 'fio test-fio-randwrite.txt' (10 runs):

       6152.997119      task-clock (msec)         #    3.304 CPUs utilized            ( +-  0.29% )
             1,422      context-switches          #    0.231 K/sec                    ( +-  3.45% )
                28      cpu-migrations            #    0.004 K/sec                    ( +-  7.47% )
            15,921      page-faults               #    0.003 M/sec                    ( +-  0.02% )
    19,862,315,430      cycles                    #    3.228 GHz                      ( +-  0.09% ) [83.33%]
    13,872,541,761      stalled-cycles-frontend   #   69.84% frontend cycles idle     ( +-  0.12% ) [83.34%]
     9,074,883,552      stalled-cycles-backend    #   45.69% backend  cycles idle     ( +-  0.19% ) [66.71%]
    13,494,854,651      instructions              #    0.68  insns per cycle        
                                                  #    1.03  stalled cycles per insn  ( +-  0.03% ) [83.37%]
     3,148,938,955      branches                  #  511.773 M/sec                    ( +-  0.04% ) [83.33%]
        17,701,249      branch-misses             #    0.56% of all branches          ( +-  0.23% ) [83.34%]

       1.862543230 seconds time elapsed                                          ( +-  0.35% )

srcu is better as 1.24% is better.

3) randrw

= vanilla =

 Performance counter stats for 'fio test-fio-randrw.txt' (10 runs):

       5609.976477      task-clock (msec)         #    3.249 CPUs utilized            ( +-  0.34% )
             1,407      context-switches          #    0.251 K/sec                    ( +-  0.96% )
                25      cpu-migrations            #    0.004 K/sec                    ( +-  5.37% )
            15,906      page-faults               #    0.003 M/sec                    ( +-  0.05% )
    18,090,560,346      cycles                    #    3.225 GHz                      ( +-  0.35% ) [83.36%]
    12,885,393,954      stalled-cycles-frontend   #   71.23% frontend cycles idle     ( +-  0.53% ) [83.33%]
     8,570,185,547      stalled-cycles-backend    #   47.37% backend  cycles idle     ( +-  0.59% ) [66.67%]
    11,771,620,352      instructions              #    0.65  insns per cycle        
                                                  #    1.09  stalled cycles per insn  ( +-  0.05% ) [83.35%]
     2,508,014,871      branches                  #  447.063 M/sec                    ( +-  0.05% ) [83.34%]
        18,585,638      branch-misses             #    0.74% of all branches          ( +-  0.23% ) [83.35%]

       1.726691239 seconds time elapsed                                          ( +-  0.40% )

= srcu =

       5475.312828      task-clock (msec)         #    3.246 CPUs utilized            ( +-  0.59% )
             1,399      context-switches          #    0.255 K/sec                    ( +-  1.46% )
                24      cpu-migrations            #    0.004 K/sec                    ( +-  6.27% )
            15,916      page-faults               #    0.003 M/sec                    ( +-  0.04% )
    17,583,197,041      cycles                    #    3.211 GHz                      ( +-  0.11% ) [83.33%]
    12,352,657,985      stalled-cycles-frontend   #   70.25% frontend cycles idle     ( +-  0.16% ) [83.33%]
     8,173,164,212      stalled-cycles-backend    #   46.48% backend  cycles idle     ( +-  0.19% ) [66.70%]
    11,780,176,340      instructions              #    0.67  insns per cycle        
                                                  #    1.05  stalled cycles per insn  ( +-  0.05% ) [83.36%]
     2,506,722,383      branches                  #  457.823 M/sec                    ( +-  0.06% ) [83.35%]
        18,436,877      branch-misses             #    0.74% of all branches          ( +-  0.18% ) [83.32%]

       1.686877512 seconds time elapsed                                          ( +-  0.43% )

srcu is better as 2.3%

Srcu is better than down_read but I don't believe either because when I
did perf record, [up|down]_read and srcu_read_[lock|unlock] is really
minor (about 0.5%) so that I think it's really marginal.
(for example, if we removes srcu_read_[un]lock totally, we just enhance
about 1%) So, I don't think it's worth.

Okay, if you concerns on the data still, how about this?
Even, it would be smaller instructions than [up|down]_read so I guess
it could remove your performance concern. But I don't believe
it could make significant difference, either.
Hope it addresses your concern.

Thanks.

>From e3f0965e692a3d085bb4ff25a774291e3f269550 Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan@...nel.org>
Date: Fri, 30 Jan 2015 09:57:37 +0900
Subject: [PATCH] zram: remove init_lock in zram_make_request

Admin could reset zram during I/O operation going on so we have
used zram->init_lock as read-side lock in I/O path to prevent
sudden zram meta freeing.

However, the init_lock is really troublesome.
We can't do call zram_meta_alloc under init_lock due to lockdep splat
because zram_rw_page is one of the function under reclaim path and
hold it as read_lock while other places in process context hold it
as write_lock. So, we have used allocation out of the lock to avoid
lockdep warn but it's not good for readability and fainally, I met
another lockdep splat between init_lock and cpu_hotplug from
kmem_cache_destroy during working zsmalloc compaction. :(

Yes, the ideal is to remove horrible init_lock of zram in rw path.
This patch removes it in rw path and instead, add atomic refcount
for meta lifetime management and completion to free meta in process
context. It's important to free meta in process context because
some of resource destruction needs mutex lock, which could be held
if we releases the resource in reclaim context so it's deadlock,
again.

Signed-off-by: Minchan Kim <minchan@...nel.org>
---
 drivers/block/zram/zram_drv.c | 72 +++++++++++++++++++++++++++++++------------
 drivers/block/zram/zram_drv.h |  2 ++
 2 files changed, 54 insertions(+), 20 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index aa5a4c54f057..9c69c35eace9 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -55,7 +55,7 @@ static DEVICE_ATTR_RO(name);
 
 static inline int init_done(struct zram *zram)
 {
-	return zram->meta != NULL;
+	return zram->disksize != 0;
 }
 
 static inline struct zram *dev_to_zram(struct device *dev)
@@ -350,6 +350,8 @@ static struct zram_meta *zram_meta_alloc(int device_id, u64 disksize)
 		goto out_error;
 	}
 
+	init_completion(&meta->complete);
+	atomic_set(&meta->refcount, 1);
 	return meta;
 
 out_error:
@@ -358,6 +360,23 @@ out_error:
 	return NULL;
 }
 
+static inline bool zram_meta_get(struct zram_meta *meta)
+{
+	if (!atomic_inc_not_zero(&meta->refcount))
+		return false;
+	return true;
+}
+
+/*
+ * We want to free zram_meta in process context to avoid
+ * deadlock between reclaim path and any other locks
+ */
+static inline void zram_meta_put(struct zram_meta *meta)
+{
+	if (atomic_dec_and_test(&meta->refcount))
+		complete(&meta->complete);
+}
+
 static void update_position(u32 *index, int *offset, struct bio_vec *bvec)
 {
 	if (*offset + bvec->bv_len >= PAGE_SIZE)
@@ -719,6 +738,9 @@ static void zram_bio_discard(struct zram *zram, u32 index,
 
 static void zram_reset_device(struct zram *zram, bool reset_capacity)
 {
+	struct zram_meta *meta;
+	u64 disksize;
+
 	down_write(&zram->init_lock);
 
 	zram->limit_pages = 0;
@@ -728,14 +750,20 @@ static void zram_reset_device(struct zram *zram, bool reset_capacity)
 		return;
 	}
 
+	meta = zram->meta;
+
 	zcomp_destroy(zram->comp);
 	zram->max_comp_streams = 1;
-	zram_meta_free(zram->meta, zram->disksize);
-	zram->meta = NULL;
+	disksize = zram->disksize;
+	zram_meta_put(meta);
+	/* Read/write handler will not handle further I/O operation. */
+	zram->disksize = 0;
+	wait_for_completion(&meta->complete);
+	/* I/O operation under all of CPU are done so let's free */
+	zram_meta_free(zram->meta, disksize);
 	/* Reset stats */
 	memset(&zram->stats, 0, sizeof(zram->stats));
 
-	zram->disksize = 0;
 	if (reset_capacity)
 		set_capacity(zram->disk, 0);
 
@@ -908,23 +936,25 @@ static void zram_make_request(struct request_queue *queue, struct bio *bio)
 {
 	struct zram *zram = queue->queuedata;
 
-	down_read(&zram->init_lock);
-	if (unlikely(!init_done(zram)))
+	if (unlikely(!zram_meta_get(zram->meta)))
 		goto error;
 
+	if (unlikely(!init_done(zram)))
+		goto put_meta;
+
 	if (!valid_io_request(zram, bio->bi_iter.bi_sector,
 					bio->bi_iter.bi_size)) {
 		atomic64_inc(&zram->stats.invalid_io);
-		goto error;
+		goto put_meta;
 	}
 
 	__zram_make_request(zram, bio);
-	up_read(&zram->init_lock);
+	zram_meta_put(zram->meta);
 
 	return;
-
+put_meta:
+	zram_meta_put(zram->meta);
 error:
-	up_read(&zram->init_lock);
 	bio_io_error(bio);
 }
 
@@ -946,21 +976,22 @@ static void zram_slot_free_notify(struct block_device *bdev,
 static int zram_rw_page(struct block_device *bdev, sector_t sector,
 		       struct page *page, int rw)
 {
-	int offset, err;
+	int offset, err = -EIO;
 	u32 index;
 	struct zram *zram;
 	struct bio_vec bv;
 
 	zram = bdev->bd_disk->private_data;
+	if (unlikely(!zram_meta_get(zram->meta)))
+		goto out;
+
+	if (unlikely(!init_done(zram)))
+		goto put_meta;
+
 	if (!valid_io_request(zram, sector, PAGE_SIZE)) {
 		atomic64_inc(&zram->stats.invalid_io);
-		return -EINVAL;
-	}
-
-	down_read(&zram->init_lock);
-	if (unlikely(!init_done(zram))) {
-		err = -EIO;
-		goto out_unlock;
+		err = -EINVAL;
+		goto put_meta;
 	}
 
 	index = sector >> SECTORS_PER_PAGE_SHIFT;
@@ -971,8 +1002,9 @@ static int zram_rw_page(struct block_device *bdev, sector_t sector,
 	bv.bv_offset = 0;
 
 	err = zram_bvec_rw(zram, &bv, index, offset, rw);
-out_unlock:
-	up_read(&zram->init_lock);
+put_meta:
+	zram_meta_put(zram->meta);
+out:
 	/*
 	 * If I/O fails, just return error(ie, non-zero) without
 	 * calling page_endio.
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index b05a816b09ac..07e55ff84a9c 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -96,6 +96,8 @@ struct zram_stats {
 struct zram_meta {
 	struct zram_table_entry *table;
 	struct zs_pool *mem_pool;
+	atomic_t refcount;
+	struct completion complete; /* notify IO under all of cpu are done */
 };
 
 struct zram {
-- 
1.9.1


> 
> > But I guessed most of overhead are from [de]compression, memcpy, clear_page
> > That's why I guessed we don't have measurable difference from that.
> > What's the data pattern if you use iozone?
> 
> by "data pattern" you mean usage scenario? well, I usually use zram for
> `make -jX', where X=[4..N]. so N concurrent read-write ops scenario.

What I meant is what data fills I/O buffer, which is really important
to evaluate zram because the compression/decompression speeds relys on it.

> 
> 	-ss
> 
> > I guess it's really simple pattern compressor can do fast. I used /dev/sda
> > for dd write so more realistic data. Anyway, if we has 10% regression even if
> > the data is simple, I never want to merge it.
> > I will test it carefully and if it turns out lots regression,
> > surely, I will not go with this and send the original patch again.

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/