hoodwink.d enhanced
RSS
2.0
XHTML
1.0

RedHanded

Futurism: Unicode In Ruby #

by why in inspect

When asked about the future of Unicode in Ruby 1.9/2.0, Matz replied to Ruby-Core with the following laundry list of features he expects in Ruby’s multibyte character support:

  • characters are represented by single character strings.
  • so that "abc"[0] returns "a" instead of fixnum 97.
  • all string methods are aware of multibyte characters.
  • new method String#encoding gives character encoding name (e.g. "utf-8").
  • new method IO#encoding gives character encoding name for reading data.
  • new method IO#encoding= sets the character encoding for reading data.

A library which emulates this could be built, based on Ruby’s current iconv lib. Anybody want to take a stab at it?

said on 07 Jan 2005 at 15:42

do we have a time frame for the release of ruby 2? Has Matz finished with the 1.8 release now I wonder?

said on 07 Jan 2005 at 15:44

By the way, Why, how is chapter 6 of the poignant guide shaping up? Christmas has been and gone, you know…

said on 08 Jan 2005 at 13:11

What exactly can I say to placate you? I can’t have you in despair.

said on 08 Jan 2005 at 14:12

I would.

said on 09 Jan 2005 at 03:04

Hows about ‘chapter 6 is being uploaded now.’ That’d do it.

said on 09 Jan 2005 at 22:48

what about different charsets? Are Ruby strings going to be stored as Unicode (I assume not)? If not, then will Ruby have a pluggable charset handler or some such? Will there be a String#charset? What about String#language (I vaguely remember each string instance in Parrot will be tagged with charset, encoding, and language).

said on 10 Jan 2005 at 14:09

isn’t charset==encoding ? could you elaborate on ruby-talk, maybe?

said on 14 Jan 2005 at 06:05

charset and encoding are two different concepts. unicode is a charset (the supposedly be-all and end-all over all charsets). unicode can be encoded in UTF -8, UTF -16, etc.

said on 17 Jan 2006 at 03:12

An encoding implies a character set, so that isn’t really necessary.

said on 27 Jan 2006 at 11:07

Just a thought, how about a String#convert_char that takes a block with 3 parameters, the encoding of the original char, the desired target encoding, and the character itself?

The block converts the string one character at a time, each time returning the converted character in the target encoding. The results of this block are concatenated to form the target string. This could be used internally to support the UTF -8 encoding.

However, there’s still the issue of the byte order mark…

Comments are closed for this entry.