hoodwink.d enhanced
RSS
2.0
XHTML
1.0

RedHanded

Closing in on Unicode with Jcode #

by why in inspect

Patrick Hall has a great article on using the Jcode module for Ruby, which provides a more natural support for hacking Unicode strings. He has a few simple unit tests that illustrate failings in the Jcode library and leaves right there for us to glare at.

 def test_reverse
   s = "Καλημέρα κόσμε!" 
   srev = s.reverse
   assert_equal(s,srev) # fails
 end

 def test_index
   # String#index isn't Unicode-aware, it's counting bytes
   # there are ways aorund this, but...
   s = "Καλημέρα κόσμε!" 
   assert_equal(0, s.index('Κ')) # passes
   assert_equal(1, s.index('α')) # fails!
   assert_equal(3, s.index('α')) # passes; 3rd byte!
 end

Sure, we’ll have all the answers in the future, but, for now, I’d say some patches to Jcode are in order. Or, to spirit up some Python mimickry:

 class UString < String
   # Show u-prefix as in Python
   def inspect; "u#{ super }" end

   # Count multibyte characters
   def length; self.scan(/./).length end

   # Reverse the string
   def reverse; self.scan(/./).reverse.join end
 end

 module Kernel
   def u( str )
     UString.new str.gsub(/U\+([0-9a-fA-F]{4,4})/u){["#$1".hex ].pack('U*')}
   end
 end 

 str = u"Ruby-語" 
 str.length   #=> 6
 str.reverse  #=> u"語-ybuR" 

Anyway, Patrick’s blog is a great tour through easy digestable tidbits about Unicode. (Thanks, Jonas!)

said on 12 Jun 2005 at 20:35

Why, hello Why! Thanks for the linkage.

Chad Fowler was kind enough to point out the cluelessosity of my test_reverse ... I could plead cut & paste idiocy, but instead I’ll just fix it:

  def test_reverse
    # there are ways aorund this, but...
 s = "Καλημέρα κόσμε!" 
    reversed = "!εμσόκ αρέμηλαΚ" 
    srev = s.reverse
    assert_equal(reversed,srev) # fails
  end
Oh dear… I fear I’ve been escaped. Well anyway, it’s fixed in my post now.

PS. What ever happened to the timid foxfaced girl? I have lost sleep, I tell you.

said on 14 Jun 2005 at 20:23

Why: Your poignant guide is awesome! I accidentally learned ruby while reading it, however…

Anyway, in Chap. 5, you conflate the class names WishScanner and MindScanner.

Also, I defeated Dwemthy’s array with 1 rabbit by doing


10000.times { r % r } ; r  

initially, and whenever I needed a health boost :-D. Yes, you can eat lettuce and poop on yourself for fun and profit! :-D

said on 29 Jun 2005 at 15:35

How about downcase/upcase?

said on 18 Dec 2005 at 18:05

if you are still curious, I am slowly hacking my way through this here

http://julik.textdriven.com/svn/tools/ rails_plugins/unicode_hacks/lib/unicode_hacks.rb

Primarily because I don’t care when that stuff is going to be in the ruby core. I am developing UTF apps now and I need it to work now. And subclassing and flagging is all broken because then every programmer on earth who doesn’t speak some non-latin language will just skim on it and use the Usual String Of Bytes and instead of producing bad text yourself you will be delegating it to others.

You need a gem for this to work because otherwise I would end up storing the whole Han table in pure Ruby. That’s alot of’em characters.

Comments are closed for this entry.