phc-discussions - Re: [PHC] CJK character sets

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20131127204136.GM24286@brightrain.aerifal.cx>
Date: Wed, 27 Nov 2013 15:41:36 -0500
From: Rich Felker <dalias@...ifal.cx>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] CJK character sets

On Wed, Nov 27, 2013 at 08:17:44PM +0000, Marsh Ray wrote:
> http://en.wikipedia.org/wiki/Hanzi
> 
> Certain Asian "CJK" regions use character sets that are quite large.
> For example, in Japan college-entry level literacy includes over
> 2000 characters. PRC China officially uses a 'simplified' character
> set having 3500 and 7000 'frequently' and 'commonly' used
> characters. My understanding is that most of these characters
> represent semantic units similar to words.
> 
> Having fluency in an alphabet orders of magnitude larger than our
> tiny Western alphabets surely changes the password strength problem.
> I would expect that it would make it easier to create and remember
> strong entropy. A short 8-character password in a Western script
> could perhaps be more like a pass phrase in Chinese-based script.
> 
> Is there any available research on the entropy of passwords in
> typical use in these regions?
> 
> Do these regions present any special considerations (other than
> proper encoding) for the PHC?

I'm not aware of any such research, but anecdotally, my experience is
that a large number of East Asian users use 100%-decimal-digit
passwords. I suspect this is due to the risk of ideographic characters
in passwords not being accepted at all, or being misinterpreted due to
different encoding issues (some characters have lookalikes, and
sometimes which lookalike gets used depends on how the local legacy
encoding is converted to Unicode), and perhaps the fact that entering
these characters via an IME exposes them visually during entry.

Anyway, my point is that even if such research exists, it might not be
applicable if a large portion of users are still using all-digit
passwords.

Rich