hoodwink.d enhanced
RSS
2.0
XHTML
1.0

RedHanded

Mucking With Unicode for 1.8 #

by why in inspect

The idea here with this little project is to enhance the strings in Ruby 1.8 to support encodings. Following the plan of Matz and without breaking extensions and still allowing raw strings.

For now, you have to specify the encoding when you create the string:

 >> str = utf8("色は匂へど 散りぬるを")
 >> str[1,2]
 => は匂
 >> str[/散(.{2})/u, 1]
 => りぬ

I can’t use wchar, since I’m adding onto the RString class, which stores the raw bytes in RSTRING(str)->ptr. And I’ve got to hook into Ruby’s regexps, can’t ignore that. So, instead, I’ve added an indexed array of character lengths. I’m not suggesting this is the answer, but consider that we have so little out there. When the string initially gets stored, it gets validated against the rules for the encoding and all the character sizes get stored.

 >> require 'wordy'
 >> utf8("ვეპხის ტყაოსანი").index_s
 => [3, 3, 3, 3, 3, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3]

The index_s method gives a list of the byte sizes for each character. I only support UTF-8 presently.

The speed is pretty good. Creating new strings, adding string and dup’ing strings end up being generally just as fast as builtin strings. Substrings and slicing don’t compare, though. But not much additional memory is used. One 4-byte index is used for every 16 characters. So, it’s about 20-25% over the raw string.

The repository is here. I could use some help finding a replacement for bitcopy, which is like a memcpy with bit offsets. The one I’m using is fast but buggy.

Wait, uh: I’m going to hold off an watch what Nikolai is doing here.

said on 18 Jul 2006 at 04:22

There’s allready an initiative to get some unicode support. It seems like they’re trying to achieve the same goal:

http://julik.textdriven.com/svn/tools/rails_plugins/unicode_hacks/

It will be put in core for 1.2:

http://www.ruby-forum.com/topic/72893#new

said on 18 Jul 2006 at 05:05

class String
  def +@() Wordy.new(self) end
end

(+"Smörgåsbord").class # => Wordy

said on 18 Jul 2006 at 06:05

+1, Qerub. Make this hack as simple to use as possible.

said on 18 Jul 2006 at 06:50

So basically you made the 15th Unicode string implementation for Ruby. Without normalisation.

Can’t name it the time well spent, although the effort is noble.

Check out ICU4R for something four times as complete (and about 50 times as large).

said on 18 Jul 2006 at 09:33

interesting!

said on 18 Jul 2006 at 09:37

Thijs: This one’s in C.

Qerub: Cool!

Julik: This one’s in C.

said on 18 Jul 2006 at 09:55

@why: just a thought. I don’t know much about Unicode implementations.

If you store the index of the char, rather the it’s length, so that index_s in your above example would be:

[0, 3, 6, 9, 12, 15, 18, 19, 22, 25, 28, 31, 34, 37, 40]
you can
  • simply calculate length by difference
  • accessing a char anywhere in the string in O(1)
  • calculate the byte length of any slice in O(1)
you must
  • recalculate all indexes when something is inserted
  • concatenation is also more work

but it seems more natural to me. the old linked-list-versus-pointer-array problem.

said on 18 Jul 2006 at 10:14

If you do that, though, then you have to store an unsigned long for every character. So you might as well just store a pointer to that character. I went that route at first, just while playing around, but you don’t really loose that much speed with the offsets.

On big documents sure, but wordy_charpos also will compute the offset from the end of the list (or a supplied pointer from a previous search,) whichever is most efficient.

said on 18 Jul 2006 at 10:19

“the old linked-list-versus-pointer-array problem”

which is usually solved by using some form of tree structure with O(lg n) access times instead.

Either way, storing indexes is hardly the way to deal with UTF -8. Then you might as well use ICU4R as turning the string into UTF -16 should be about as time-consumirg in the first run as this index calculating stuff.

Anyway, I have a library that’s still under heavy development, but it does do all the mostly used stuff.

You can check it out at my git repository.

What it does is mixes in methods into the String class, using the u”...” notation _why previously demonstrated. I guess using +”...” is better, though, as it will work in all cases.

Anyway, it doesn’t do anything fancy. It just treats the contents of the String (or RString) as a sequence of bytes. The only thing I might like to add to RString would be a char_len field (or something similarly named) that keeps track of the number of “characters” in the string, not the number of bytes. This would help with index checking.

What’s important, however, is that it works.

To give credit where credit is due: The actual utf-8 handling is heavily based on/borrowed/stolen code from the glib library.

The library is currently based on Unicode 4.1, but as Unicode 5.0 got out of beta today (!), there will hopefully be support for 5.0 soon.

said on 18 Jul 2006 at 10:36

For anyone who wants to try nweibull’s lib:

 cg-clone http://git.bitwi.se/unicode.git/

Still, you’re using the same general idea with the decomposition table. Hey, this is great stuff! I was really hoping someone would pop up with something better. Is there something special I need to do to get this to compile?

said on 18 Jul 2006 at 10:53

The decomposition stuff is only used for normalization – or am I missing something?

To compile:

$ ruby extconf.rb
$ make

Hopefully that’ll work. You’ll probably need gcc , or at least a compiler that understands C99 . If someone wants to backport it to C89 that’s fine, but I really don’t think C is fun enough to limit myself to the dated rules of C89 .

Also, this is taken from an earlier project where the code was used for a text editor I was writing and since I’ve been in programming mode, not packacking & maintenance mode the previous week, the library will install as ‘ned/unicode’. I’m thinking that simply calling the library ‘utf8’ or ‘now/utf8’ (I’m trying to put all my libraries in the ‘now’ “namespace”) will do fine.

Running rake will run the rspec-based tests (which are far too few in number).

Anyway, my long-term goal is to make the string processing as generic as possible so that other encodings can be supported (like Oniguruma does today). That way, most of the code that will be needed for Rite will already exist by the time that stuff is going to be implemented.

However, more hints on exactly how Matz wants Strings to work in Ruby 2.0 will be necessary.

said on 18 Jul 2006 at 10:57

Ok, so now we have about 4-5 implementations in C/Ruby for doing the same thing. Cool.

said on 18 Jul 2006 at 11:00

Just to count – Unicode gem, utf8_proc, the two implementations mentioned in this entry and ICU4R .

I’m all for diversity but maybe you guys can join forces? ICU4R is in C, _why. And it will be most likely times faster than other stuff people come up with in a hackish way.

said on 18 Jul 2006 at 11:27

nweibull: Okay, I’m getting an error on FreeBSD, some conflict with the index method definition in /usr/include/strings.h. I’ll play with it.

said on 18 Jul 2006 at 11:33
Looking forward
to the day
we will have
as many implementations
as codepoints.
said on 18 Jul 2006 at 12:02

Julik: Here’s a breakdown of the libraries that exist so far:

Yoshida Masato’s Unicode library only does normalization and case conversion.

UTF8Proc seems to do even less, by only doing normalization. I may be looking at an old (0.2) version, though.

The unicode_hacks plugin for RoR does quite a lot, but in a very inefficient way, as it is written in Ruby (a lot of allocation work is done, which can be avoided in C).

ICU4R uses ICU , which is a fanstastic Unicode library (probably the most complete there is). The problem is, ICU uses UTF -16 internally, and this is definitely not always what you want.

I can’t say much about _why’s library, but it looks like it can do some nice stuff, even though it’s immature.

About my own library: It layers “flawlessly” over Ruby’s own String class, so there’s no conversion going on, there’s no extra allocation necessary, there’s no speed decrease (or increase for that matter) of operations that don’t deal with UTF -8, it does do all the stuff you would like to do. It isn’t a hack. It is, however, a work in progress, and some methods are still not implemented. Also, error checking is missing in many places, so feeding it illegal UTF -8-sequences may blow your Ruby session.

why: Ouch, typical – I use index as a variable name in quite a few places, and it seems like your compiler has issues. Using index as the name of a local variable should be acceptable, but I’ll try to come up with better names for my variables.

asdf: Seeing as how the number of possible code points are limited in Unicode, that day may yet come.

said on 18 Jul 2006 at 13:57

nweibull: Okay, after renaming index and strnstr, I’ve got it compiled. The specs don’t run for me, since there’s no Kernel#u.

It looks like Kernel#u should be:

 def u(str); str.extend UTF end

I really like what you’ve got here. I want to play with it.

said on 18 Jul 2006 at 14:28

I don’t suppose anyone has an implementation of stringprep/nameprep in Ruby floating around?

said on 18 Jul 2006 at 16:36

why: That’s weird. Are you sure unicode.rb is getting loaded? It, in turn, requires unicode.so and sets up the bindings. Kernel#u should be def u(str); str.extend(UTFMethods); end and it is defined in unicode.rb. I wonder why this isn’t working for you…

Perhaps running the specifications without having the library installed works? Although, the $LOAD_PATH.unshift '..' should make sure that everything works correctly inside the specifications.

Anyway, I’m allocating tomorrow to clean up the file structure and make sure that compilation, testing, and installation work better than they do now.

said on 18 Jul 2006 at 18:18

Julik: _why is writing it, so it’ll be 10x greater than all the other libraries!!! :D

said on 18 Jul 2006 at 18:49

If you do that, though, then you have to store an unsigned long for every character.

oh yes, that’s true. didn’t thought about that.

I’m glad you are addressing two of the most popular con’s about Ruby: bad Unicode support and lack of speed.

said on 19 Jul 2006 at 04:11

I’ve been reading Lean Software Development: An Agile Toolkit and this advocates deferring decisions where practical, because things change. It also advocates havine multiple implementations of something, because when the groups get together you can breed something wonderful from the diverse gene pool. This does require the groups to talk, though.

said on 21 Jul 2006 at 08:50

At the moment we want to be a) interoperable with others b) without nasty subclassing (what you have) c) without compiling stuff.

I think we nailed the compromises pretty well. When your extension is ready we will try to implement a handler for it.

As for the inefficiency – well, do better without C. Manfred will be eager to hear your suggestions. And do better when some strings you process might not be Unicode strings either.

said on 21 Jul 2006 at 08:55

In all fairness – if you want to flex your C muscle just pick up ICU4R and finish it.

said on 21 Jul 2006 at 09:07

Besides,

git clone http://git.bitwi.se/unicode.git/ Cannot get remote repository information. Perhaps git-update-server-info needs to be run there?

git clone git://git.bitwi.se/unicode.git/ fatal: unable to connect a socket (Connection refused) fetch-pack from ‘git://git.bitwi.se/unicode.git/’ failed.

said on 23 Jul 2006 at 11:06

Julik: If you’re addressing me above: a) remains true, b) isn’t true, c) eh?

There’s no subclassing going on here. All that happens is that the string is extended with a module.

Again, I am not here to replace someone else’s solution or claim that my solution is the last word in character encoding handling.

I’ve already stated why ICU4R isn’t a solution for me. I don’t want my strings decoded, encoded, and re-encoded. I want them to remain a sequence of bytes.

Strange that cg-clone worked. I forgot to chmod post-update.

Anyway, there’s a new repository, which you can clone:

git clone http://git.bitwi.se/ruby-character-encodings.git
said on 24 Jul 2006 at 13:56

Nikolai, what I meant by “subclassing” is _why’s method of doing u”bla” (who’s going to do u”bla” on the CGI params and the database and the sockets…), I confused somewhat. Especially considering I couldn’t look at your extension because of these bizarro git things I’m barely familiar with.

The fact that ICU4R uses separate strings and regexps is a hindrance. but ICU implements alot of very very fancy Unicode mechanics (locale-aware sentence boundaries? word boundaries? locale-independent grapheme clusters?). The problem is that making your own will oblige you to implement all (or some of these) by yourself. OTOH , ICU4R is abandoned now and you can easily rig it into Ruby’s String (which will make it much more useable).

We did the mixing in of methods somewhat differently (making a wrapper around Strings with same API , but preserving the original string). We also do native regexps (with character offsets) and gsub – basically all that String does.

All I hope for is we can make a handler for our plugin with your extension when it’s ready. And when it builds and runs on OS X without diffs, for that matter :-)

said on 24 Jul 2006 at 17:01

Julik: Dealing with grapheme clusters and various boundaries is of course awesome, so hopefully I or someone else (hint, hint) will write code for that as well.

I don’t own a Mac, and I don’t run FreeBSD or Windows, so someone will have to help me with the OS porting.

Anyway, all I hope now is that I didn’t announce this library too soon.

Current task is to write up specifications so that we can ensure backwards compatibility while supporting unicode as well.

11 Jul 2010 at 21:30

* do fancy stuff in your comment.

PREVIEW PANE