Closing in on Unicode with Jcode #

by why in inspect

Patrick Hall has a great article on using the Jcode module for Ruby, which provides a more natural support for hacking Unicode strings. He has a few simple unit tests that illustrate failings in the Jcode library and leaves right there for us to glare at.

 def test_reverse
   s = "Καλημέρα κόσμε!" 
   srev = s.reverse
   assert_equal(s,srev) # fails

 def test_index
   # String#index isn't Unicode-aware, it's counting bytes
   # there are ways aorund this, but...
   s = "Καλημέρα κόσμε!" 
   assert_equal(0, s.index('Κ')) # passes
   assert_equal(1, s.index('α')) # fails!
   assert_equal(3, s.index('α')) # passes; 3rd byte!

Sure, we’ll have all the answers in the future, but, for now, I’d say some patches to Jcode are in order. Or, to spirit up some Python mimickry:

 class UString < String
   # Show u-prefix as in Python
   def inspect; "u#{ super }" end

   # Count multibyte characters
   def length; self.scan(/./).length end

   # Reverse the string
   def reverse; self.scan(/./).reverse.join end

 module Kernel
   def u( str )
     UString.new str.gsub(/U\+([0-9a-fA-F]{4,4})/u){["#$1".hex ].pack('U*')}

 str = u"Ruby-語" 
 str.length   #=> 6
 str.reverse  #=> u"語-ybuR" 

Anyway, Patrick’s blog is a great tour through easy digestable tidbits about Unicode. (Thanks, Jonas!)

said on 12 Jun 2005 at 20:35

Why, hello Why! Thanks for the linkage.

Chad Fowler was kind enough to point out the cluelessosity of my test_reverse ... I could plead cut & paste idiocy, but instead I’ll just fix it:

  def test_reverse
    # there are ways aorund this, but...
 s = "Καλημέρα κόσμε!" 
    reversed = "!εμσόκ αρέμηλαΚ" 
    srev = s.reverse
    assert_equal(reversed,srev) # fails
Oh dear… I fear I’ve been escaped. Well anyway, it’s fixed in my post now.

PS. What ever happened to the timid foxfaced girl? I have lost sleep, I tell you.

said on 14 Jun 2005 at 20:23

Why: Your poignant guide is awesome! I accidentally learned ruby while reading it, however…

Anyway, in Chap. 5, you conflate the class names WishScanner and MindScanner.

Also, I defeated Dwemthy’s array with 1 rabbit by doing

10000.times { r % r } ; r  

initially, and whenever I needed a health boost :-D. Yes, you can eat lettuce and poop on yourself for fun and profit! :-D

said on 29 Jun 2005 at 15:35

How about downcase/upcase?

said on 18 Dec 2005 at 18:05

if you are still curious, I am slowly hacking my way through this here

http://julik.textdriven.com/svn/tools/ rails_plugins/unicode_hacks/lib/unicode_hacks.rb

Primarily because I don’t care when that stuff is going to be in the ruby core. I am developing UTF apps now and I need it to work now. And subclassing and flagging is all broken because then every programmer on earth who doesn’t speak some non-latin language will just skim on it and use the Usual String Of Bytes and instead of producing bad text yourself you will be delegating it to others.

You need a gem for this to work because otherwise I would end up storing the whole Han table in pure Ruby. That’s alot of’em characters.

