RedHanded » Nikolai's UTF-8 Lib is All Ready

Nikolai's UTF-8 Lib is All Ready #

Last week, in the comments, Nikolai Weibull brought up his UTF-8 lib, a lovely creature which meets my own needs much better than what’s already out there. I like it much better than my own efforts. Especially now that he’s had some time to flesh it out.

Namely: It’s small. It’s coded in C. It locks into Ruby’s existing string class. Therefore, it can be efficient with memory and use Ruby’s own regexps.

 require 'encoding/character/utf-8'
 str = u"hëllö" 
 str.length
   #=> 5
 str.reverse.length
   #=> 5
 str[/ël/]
   #=> "ël"

If you’d like to follow development, clone this (git-web.) I’ve also put up a gem: gem install character-encodings --source code.whytheluckystiff.net, but obviously it’s not an official release or anything.

24 Jul 2006 at 10:22 | 32 comments

FlashHater

said on 24 Jul 2006 at 11:29

Yey!! Oh wait, I’m an American, and thus have no use for international standards… >.<

Daniel Berger

said on 24 Jul 2006 at 12:08

I guess Tim Bray will have to update his slides for RubyConf 2006 now. :)

oliver

said on 24 Jul 2006 at 12:31

When installing the gem I get the following error on OS X . Any ideas?

Besides, this seems what I have been looking for a long time. :)


Attempting local installation of 'character-encodings'
Local gem file not found: character-encodings*.gem
Attempting remote installation of 'character-encodings'
Updating Gem source index for: http://code.whytheluckystiff.net
Building native extensions.  This could take a while...
cc1: warnings being treated as errors
properties.c: In function ‘remove_all_combining_dot_above’:
properties.c:571: warning: control reaches end of non-void function
make: *** [properties.o] Error 1
cc1: warnings being treated as errors
properties.c: In function ‘remove_all_combining_dot_above’:
properties.c:571: warning: control reaches end of non-void function
make: *** [properties.o] Error 1
ruby extconf.rb install character-encodings --source code.whytheluckystiff.net

13

said on 24 Jul 2006 at 12:35

oliver: Same here (using Ubuntu) :(

psmith

said on 24 Jul 2006 at 12:40

oliver: I had to apply this diff to get it to compile:


--- ext/encoding/character/utf-8/extconf.rb.orig        2006-07-24 13:24:39.000000000 -0500
+++ ext/encoding/character/utf-8/extconf.rb     2006-07-24 13:25:01.000000000 -0500
@@ -22,7 +22,6 @@
 try_compiler_option('-Wundef')
 try_compiler_option('-Wpointer-arith')
 try_compiler_option('-Wcast-align')
-try_compiler_option('-Werror')
 # XXX: sadly, -Wshadow is a bit too strict.  It will, for example, whine about
 # local variables called “index” on FreeBSD.
 # try_compiler_option('-Wshadow')

Then I had to manually copy the utf8.bundle to the lib/encoding/character directory and apply this diff:


--- lib/encoding/character/utf-8.rb.orig        2006-07-24 13:29:23.000000000 -0500
+++ lib/encoding/character/utf-8.rb     2006-07-24 13:34:25.000000000 -0500
@@ -2,7 +2,7 @@
 #
 # Copyright © 2006 Nikolai Weibull <now@bitwi.se>

-require 'encoding/character/utf-8/utf8.so'
+require 'encoding/character/utf8.bundle'

 # TODO: Rework this to use a dispatch object instead, so that the encoding can
 # be changed on the fly.

FlashHaterHater

said on 24 Jul 2006 at 12:53

Damn straight. Why do you think so many people in other countries speak English?

why

said on 24 Jul 2006 at 12:54

Okay, the gem is updated to turn off -Werror for now.

psmith

said on 24 Jul 2006 at 12:59


irb(main):001:0> require 'encoding/character/utf-8'
=> true
irb(main):002:0> str = u"hëllö" 
=> u"h\303\253ll\303\266" 
irb(main):003:0> str.size
=> 7
irb(main):004:0> RUBY_VERSION
=> "1.8.4" 
irb(main):005:0> RUBY_PLATFORM
=> "i686-darwin8.6.2"

why not

said on 24 Jul 2006 at 13:49

Why not just return what the function should??

At the end of remove_all_combining_dot_above(), either: return decomp_len; or: return 0;

...as appropriate (I believe that with -Werror off, it’ll return 0). Probably the latter, but I’m having trouble parsing the function, so I can’t be sure.

nil

said on 24 Jul 2006 at 14:50

but everyone will speak ruby as the offical language of the rubiverse! :)

sporkmonger

said on 24 Jul 2006 at 15:11

How is it at dealing with unicode normalization?

Manfred

said on 24 Jul 2006 at 16:49

Length returns the number of codepoints, not what we would think of as ‘characters’. Using NFC or NFKC doesn’t solve this either because there are a lot of characters which can’t be composed.

The solution for this is ‘grapheme clusters’, described in the unicode standard annex 29. The suggested implementation in the annex covers most of the characters used in everyday life.

It looks like Nicolai didn’t implement this.

nweibull

said on 24 Jul 2006 at 17:46

OK, to everyone on OS X , I’ve now updated the require to read require 'encoding/character/utf-8/utf8', so Ruby should be able to figure out the extension for itself.

Second, about the whole -Werror and -W stuff, sorry. For some reason, my compiler (gcc 3.4.6) didn’t report the “non-void without return” error. I’ll have to look into that. It’s good that I didn’t disable -Werror, however, as it was a horrible bug. The code has been fixed now. (Update, it’s the -std=c99 that does it. I have no explanation for this behavior – it seems that the c99 code generator will fill in the return instruction anyway – I’ll upgrade to 4.1 soon.)

About #size: it’s not overriden, so it’ll return the number of bytes in the string. I don’t know if that makes sense at all, but that’s how it currently works. Perhaps an addition of #byte_length makes more sense.

sporkmonger: Good, although there’s no Ruby interface for it yet. I’ll add a method for normalization tomorrow.

Manfred: Ni/k/olai, if you please. And you are definitely right about “grapheme clusters”. It’s certainly something that is worth supporting. However, normalization does cover a lot of the everyday cases, so it’s not like we can’t do without “grapheme clusters” either.

I guess next on the list is to create a rubyforge project so that we can have a mailing list instead of discussing everything here.

oliver

said on 25 Jul 2006 at 00:48

Nikolai: Thanks, for doing the OS X bugfix.

I am looking forward to the rubyforge project.

Manfred

said on 25 Jul 2006 at 01:49

Nikolai: Oops, sorry I misspelled your name. I would love to see a mailing list to discuss some things (:

Manfred

said on 25 Jul 2006 at 03:52

Sorry, but I have to comment some more.

Normalizing a string is not enough to “cover a lot of the everyday cases” there are a lot of characters which can’t be composed. Nikolai’s utf-8 library doesn’t expose normalization yet, so I’m using an alternative library for this example:

<cpde>c = [0xFB1D].pack('U') # HEBREW LETTER YOD WITH HIRIQ
c.chars.length #=> 1
c = c.chars.normalize_D
c.unpack('U*').length #=> 2
c = c.chars.normalize_KC
c.unpack('U*').length #=> 2

Even though everybody would agree that this is one character, slicing between these codepoints will leave us with a different character than we started with. In German for some words the difference between singular and plural form is an Umlaut, if we chop this accent off with a broken slice we significantly change the meaning of the text.

The other problem with this solution is that it modifies the string methods, every ruby programs expects the length method to return the length in bytes. Consider this:

str = u"hëllö" 
headers << "Content-length: #{str.length}"

I’ve ran into this in the past myself and believe me, it wreaks havoc in Webrick.

nweibull

said on 25 Jul 2006 at 04:52

Manfred: There’s always #size, which remains unchanged. But please do come with a suggestion that makes both length easily accessible (and perhaps the third, the number of grapheme clusters…). I agree that overriding methods do have some negative consequences as well. Of course, the methods are overriden on a per-object basis, so in your example above, you, as a developer, should be aware that when you say that str is a UTF -8-encoded string, #length will not return the number of bytes in str, but rather the number of codepoints in str.

All: There’s now a Rubyforge project set up for the character-encodings library.

There’s a mailing list, called char-encodings-development, which isn’t active yet. Hopefully Tom will get it set up sooner rather than later :-).

will

said on 25 Jul 2006 at 07:31

nweibull, intuitively #size would return the length in bytes and #length should return the ‘character’ length for me.

Boris K

said on 25 Jul 2006 at 07:34

This is great work. Thank you Nikolai, thank you _why.

Object

said on 26 Jul 2006 at 12:39


def language_spoken(date) 
  date > Date.today ? 'ruby' : 'utf-8'
end

MenTaLguY

said on 26 Jul 2006 at 15:45

Nikolai: The problem really isn’t that you’ve got to convince ruby developers that String#length doesn’t mean the same thing as it normally does, it’s that you’ve potentially got to convince every piece of Ruby code ever written.

I can see a change like this between, say, Ruby 1.8 and 1.9. I’ve got a harder time justifying it for a library. Does 1.9 at least have a similar change, by the way?

nweibull

said on 26 Jul 2006 at 15:57

MenTaLguY: Well, it all depends. I again stress that this is on a per-object level, so it won’t change anything unless you explicitly tell it to, which means that changing the meaning of String#length isn’t all that drastic after all. However, often when you mean when you say String#width, or perhaps you actually want to take grapheme clusters into account, so it’s all rather fuzzy as it is.

MonkeeSage

said on 27 Jul 2006 at 07:48

It seems to me that the Right Thing to Do is to mixin methods to the String class globally rather than on a per-object basis, adding a “u” prefix (“ulength”, “ureverse”, &c).

Pros:

A clear distinction betwen byte-oriented and charcter-oriented methods.
No need to initialize strings with a special method (e.g., “u()” or ”+”)—all strings are potentially UTF strings, the only time that comes into play is when the “u”-methods are called.
One can always use the “u”-methods in a mixed-data context, just to be safe, as they work for UTF only ordinals and the ASCII subset, but still have access to the old byte methods.
Existing programs won’t break.

Cons:

The bracket-accessor method “[]” – should some kind of “at” method be used instead (“uat()”), in order to preserve the byte-based indexing using this method but still allow character-based indexing? Other similar questions.
One has to remember to use the “u”-method versions (I’m sure that would bite me from time to time).

There may be other pros/cons, but those are the ones I could think of.

nweibull

said on 27 Jul 2006 at 13:12

MonkeeSage: I don’t think that’s an appropriate solution, as I want to keep it as true to the idea of Strings in Rite as possible, i.e., every string has an encoding. The encoding can be accessed through #encoding, and I do think that it would be worth-wile to be able to change the encoding on the fly for any given string. That way you can easily switch between treating a string as a sequence of bytes, and something more advanced such as UTF -8, UTF -16, or some such. That way I think one solves the problem without changing the interface per se.

Remember, the only reason #length returns the length of the string in bytes is because that’s the way it is currently implemented. Having to live with such a restriction for all eternity seems rather silly.

MonkeeSage

said on 28 Jul 2006 at 09:00

Nikolai: I think aiming for Rite compatibility is probably a good idea. Personally though, I’m still not sure I like the idea of an encoding associated with every string by default, even if it is adopted by Matz for Ruby2 (not much I can do about that though!).

IMO , all strings should be thought of simply as groups of bytes on the basic class level (and offer byte-level access). The where and what of those groupings and so forth should come in at the level of manipulating them, i.e., through methods or subclasses or modules. It would be very easy to have, e.g., EncodedString < String, which you have to initialize with a method that takes an encoding and has methods for manipulating and translating encoded strings and adds an encoding attr.

Anyhow, implementation gripes aside, I appreciate the work you’re doing, please keep it up!

why

said on 28 Jul 2006 at 11:09

MonkeeSage: It seems like the distinction between your approach and Nikolai’s is really very minor. Nikolai is storing a byte string underneath it all.

 >> require 'encoding/charater/utf-8'
 >> gem = u(File.read("sources-0.0.1.gem"))
 >> gem.size
 => 3072
 >> gem.length
 => 2941
 >> gem.normalize
 => u"data.tar.gz" 

 >> require 'stringio'
 >> StringIO.new(gem).read
 => "data.tar.gz\000\000\000\000\000\000..."

But, yeah, Nikolai’s storming right into the class and overriding all kinds of methods. I don’t know if Matz wants length and size to be different, but it makes pretty good sense to me.

MonkeeSage

said on 28 Jul 2006 at 13:57

I suppose I could always do something like:


require('encoding/character/utf-8')
class String
  def method_missing(m, *args)
    if (m.to_s[0,1] == 'u')
      u(self).method(m.to_s[1..-1]).call(*args)
    else
      raise(NoMethodError, "No method `#{m}'", caller)
    end
  end
end
puts('h&#235;ll&#246;'.ureverse)
puts('h&#235;ll&#246;'.uupcase)

MonkeeSage

said on 28 Jul 2006 at 13:59

err… hëllö -> hëllö

MonkeeSage

said on 28 Jul 2006 at 14:01

one more time… hëllö -> hëllö

why

said on 28 Jul 2006 at 15:31

Yeah, see, that’s the spirit, MonkeeSage.

MonkeeSage

said on 28 Jul 2006 at 17:06

This is better…


Encoding::Character::UTF8.methods.each do |m|
  String.class_eval(%!
    define_method("u#{m}") do |*args|
      Encoding::Character::UTF8.#{m}(self, *args)
    end
  !)
end

MonkeeSage

said on 28 Jul 2006 at 22:48

Ok, so now I’m DRY , I’m meta-tacular, and GC friendly (all thanks to redhanded)...


require('encoding/character/utf-8')
def uStringMethods()
  umethods = []
  Encoding::Character::UTF8.methods.each do |m|
    umethods.push(%!
      define_method("u#{m}") do |*args|
        Encoding::Character::UTF8.#{m}(self, *args)
      end
    !)
  end
  String.class_eval(umethods.join)
end
uStringMethods

...now if I could just figure out what the heck an eigen is, I could make matrices and vectors and classes out of it…it would be like figuring out the secret recipe of the fluff in the fluffernutter…but that’s for another day.

Nikolai: I’m having trouble building off the latest head (are thre still heads and branches in git terminology?). I’m going to join the rubyforge list and post details there.

Archive

Links

Syndicate

Nikolai's UTF-8 Lib is All Ready #

FlashHater

Daniel Berger

oliver

13

psmith

FlashHaterHater

why

psmith

why not

nil

sporkmonger

Manfred

nweibull

oliver

Manfred

Manfred

nweibull

will

Boris K

Object

MenTaLguY

nweibull

MonkeeSage

nweibull

MonkeeSage

why

MonkeeSage

MonkeeSage

MonkeeSage

why

MonkeeSage

MonkeeSage

PREVIEW PANE