(try two) On channels and servers and string encodings...

Jeremy Nelson

2007-08-25 04:25:13 UTC

The previous version (for some reason) was mangled into
quoted-unprintable so I am sending it again this time with smaller
lines so it won't be completely unreadable.

This is a discussion that is strictly academic at this time. Please
do not think that we are close to supporting any of this stuff.
Rather, I want to start a dialog about how people want this to work,
so when we do go ahead and design it, people will not think it came
out of left field.

So the question on the table is string encodings.

The input line: Now right now, epic doesn't handle encoding on the
input line -- it just assumes that each byte is one code point. For
people using utf-8 one keypress may yield one codepoint may yield
multiple bytes, which show up as multiple (incorrect) bytes in the
input line rather than the key pressed. Column counting is not
broken /as such/.

The display: Right now, epic doesn't handle encoding on the output
display. Any bytes received are just sent to the display, so if
you output a utf-8 string on an utf-8 emulator, it will show up
correctly, and if you output a utf-8 string on a iso-8859-* emulator,
it will yield multiple (incorrect) characters. Column counting is
(of course) broken.

The servers: Globally, the user can /set translation which converts
between the code points from one 8-bit character set (usually ascii)
into another 8-bit character set that the server is using. This is
fine, as long as both the user and the server are using 8-bit code
points (which is not the case for utf-8, obviously).

Channel Names: Channel names can be encoded in any encoding. A
channel name like #fr=E3nd could be encoded in iso-8859-1 and take
up 6 bytes, or the channel name could be encoded in utf-8 and take
up 7 bytes. The irc server will treat these as separate channels,
***so it's fundamentally important to be able to specify an encoding
when specifying a channel name.***

Channel messages: People who chat on the channel may (or may not) use
any encoding at any time, but usually everyone uses the same encoding,
which ***may or may not be the same encoding as the channel name
itself***.

For example, the channel name may be encoded in iso-8859-1 and the
users may agree to use utf-8. ***so it's fundamentally important
to be able to specify a different encoding for privmsgs on the channel
than is used to specify the encoding of the channel name itself.***

THEREFORE,
We're going to have to start thinking about syntax for how to specify
all this stuff on a per-channel, per-server basis. As a wild example,
we could prefix channel names with encoding, using invalid-for-channel
characters.

Example:
/join (iso-8859-1)#fr=F6nd
(join the channel, encoding the channel name in iso-8859-1)

/join (utf-8)#fr=F6nd)
(join the channel, encoding the channel name in utf-8)

/join (iso--8859-1/utf-8)#fr=F6nd
(the channel name is encoded in iso-8859-1, but privmsgs will be
encoded in utf-8)

The last thing I want to do is support utf-8 but then end up having
it be half-assed and make everyone think i'm a clod for not thinking
of every last important detail to take care of. So now is the time
to tell me what's really important for supporting a multi-encoding
irc client!

Thanks for your discussion!
Jeremy