console-input-method

Name

console-input-method — provide a pre-edit front-end processor for CJKV (and other) input methods on a user-space virtual terminal

Synopsis

console-input-method [--chinese1] [--chinese2] [--kana] [--hangeul] [--romaji] {uvcname} {lvcname} [chinese1] [chinese2] [hiragana katakana] [hangeul] [romaji]

Description

console-input-method layers an input method over a (user-space) virtual terminal, allowing the input of CJK and other characters, presenting itself as an "upper" (user-space) virtual terminal where a pre-edit front-end is overlain on top of the original "lower" virtual terminal.

It opens the character/attribute buffer file lvcname/display and the input FIFO lvcname/input. These are the back-end interfaces of the user-space virtual terminal being overlain, the "lower" virtual terminal, as detailed in console-terminal-emulator(1). It does not attempt to create these if they do not exist.

It opens the character/attribute buffer file uvcname/display and the input FIFO uvcname/input. These are the back-end interfaces of the resultant user-space virtual terminal after combining the input method, the "upper" virtual terminal. It will create these if they do not exist.

It then enters a loop where it simultaneously:

At termination, it truncates the display buffer file in muxname/.

writes all data received from the upper input FIFO to the lower input FIFO, or composes "pre-edited" input that is passed down to the lower input FIFO; and
renders the contents of the character/attribute buffer file for the lower virtual terminal on the upper virtual terminal's display buffer, merging in the display for the input method when it is active.

Input event handling

When inactive, the input method passes all input events from the upper input FIFO to the lower input FIFO, filtering out and activating the input method upon receipt of the Henkan, Muhenkan, Katakana/Hiragana/Romaji, Han/Yeong, Hanja, Katakana, Hiragana, Romaji, Hangeul, and Zenkaku/Hankaku keys. What these keys are depends from the realizer and whatever keyboard maps it is using; but conventionally the Henkan, Muhenkan, Katakana/Hiragana/Romaji, and Kanji keys are from the 106/109-key PC keyboard, the Hanja and Han/Yeong keys are from the 103/107-key PC keyboard, and the Hiragana, Katakana, Romaji, Hangeul, and Zenkaku/Hankaku keys (per the USB specification) are from non-PC keyboards.

Note that the key that is apparently a Zenkaku/Hankaku/Kanji key on the 106/109-key PC keyboard is in fact just the "Grave" key from the 101/104-key PC keyboard. To use the input method, a realizer must map that (or another) to the "IM Switch" key input event. The IM Switch key input event does not have a direct correspondence to a PC keyboard key nor a USB-specified key, but exists as an additional internal key event to map the choice of toggle key in a keyboard map. Thus the input method is relieved of the need to know about the current keyboard map(s) used by realizer(s), and altering what key toggles the input method on and off is a matter of changing keyboard maps without reconfiguring or restarting the input method itself.

When active, the input method user interface filters out all input events from the input stream generated by whatever realizer is attached to the "upper" virtual terminal, except for the session switch and consumer key next/previous/task-manager events used by console-multiplexor(1); and the "lower" virtual terminal only sees input events from the pre-edited input.

Input method configuration

The input method is entirely table-driven, controlled by data files containing tables of mappings from the original ASCII composition element sequences to the unconverted and converted characters to display. These files are CIN files, compatible with OpenVanilla, the Chinese Open Desktop, xcin, gcin, hime, OkidoKey.app, and MacOS. console-input-method only uses some of the contents of CIN files; namely the %keep_key_case, %keyname, and %chardef directives and their accompanying data. It also only accepts UTF-8 encoded CIN files.

The input method operates in one of five modes: chinese, katakana, hiragana, hangeul, and romaji. The chinese and romaji modes have two and three sub-modes, respectively. Each mode is driven by a data table, either a default table that provides a null mapping that does no translation, or one read from a CIN file. The CIN files for these modes (and sub-modes) are named by the chinese1 , chinese2 , katakana , hiragana , hangeul , and romaji command-line arguments. The romaji table filename command-line argument applies to all romaji sub-modes. For each file to be supplied as a command-line argument, the corresponding --chinese1 , --chinese2 , --kana , --hangeul , and --romaji command-line options must also be used, otherwise a default table is used in the relevant place(s) and no command-line argument naming a file is expected. The hiragana and hangeul data table filename command-line arguments are enabled with a single command-line flag.

These names are placeholders, and do not impose requirements upon what data table is used in what mode. The "chinese" modes here are an umbrella name intended for data tables for the Hanja, Kanji, and Hanzi modes of Korean, Japanese, and Chinese input. There are two "chinese" modes available, permitting the availability of (for example) both simplified and traditional Hanzi. The "romaji" mode is appropriate for non-CJKV input methods from Chinese Open Desktop/OpenVanilla/et al., such as Esperanto or Old English. In "romaji" mode, the conversions provided by the data table are augmented by four hardwired conversions that can be added: to all-lowercase, all-uppercase, title-case, and original raw ASCII. The original raw ASCII can contain spaces, the other conversions eliminate any spaces.

So, for examples, one could specify (using CIN filenames from the Chinese Open Desktop):

only needing Pinyin Chinese by using --chinese1 pinyin.cin ,
only needing Japanese by using --chinese1 --kana nippon.cin katakana.cin hiragana.cin
needing Korean and Esperanto by using --chinese1 --hangeul --romaji hanja.cin hangeul.cin esperanto.cin

Pre-editing

The pre-editing user interface comprises an editing field where the data to send are constructed, and a list of conversion choices. It is placed at the curent cursor position of the "lower" virtual terminal, subject to the caveats that it is repositioned if it would otherwise extend off-screen (and such repositioning is possible), and that it is not moved if the cursor is made invisible (as some full-screen applications make it whilst they are redrawing their user interfaces, thus preventing it from briefly flickering all over the screen).

The conversion list is generated on the fly as the original ASCII sequence is entered. At the end of the conversion list, single-headed or double-headed arrows indicate whether scrolling in each direction is possible. When the list of conversions is empty, as it initially is, the arrows are replaced by a character denoting which of the conversion modes is the currently active one.

The editing field displays converted characters up to the current cursor position, or just before it if a convertable sequence has only been partly entered, and unconverted characters from that point onwards. Unconverted characters are displayed using the "key names" specified in the active data table, which are (for examples) Hangeul Jamo or column indicators for some Chinese input methods. Converted characters are displayed using the conversions from the same table.

Using a JIS 106/109-key PC keyboard

With a 106/109-key PC keyboard and an appropriate keyboard map, the user interface can be driven with its language keys, mapped to the following input key events:

ひらツな/カタカナ/ローマ字, mapped to the abstract "Hiragana/Katakana/Romaji" key: This key's function depends from the shift level. Level 1 sets hiragana conversion mode, level 2 (⇧ Level2) sets katakana conversion mode, and level 3 (⌥ Option or ⇮ AltGr) sets romaji conversion mode and cycles through the three romaji sub-modes (all lower case, all upper case, and title case).
Note
Alternatively, a keyboard map could map each shift level to the individual abstract "Hiragana", "Katakana", and "Romaji" keys; which do the same things as the abstract "Hiragana/Katakana/Romaji" key in its various shift levels.
漢字 (level 1 or level 2 shift), mapped to the abstract "IM Switch" key: Toggle the user interface on and off.
変換, mapped to the abstract "Henkan" key: Pick the converted text from the conversion list when converting.
無変換, mapped to the abstract "Muhenkan" key: Pick the original raw ASCII from the conversion list when converting.
⌥ Option+漢字 or ⇮ AltGr+漢字 (i.e. level 3 shift), mapped to the abstract "Kanji" key: Set chinese conversion mode, cycling through both chinese sub-modes.

Using a Korean 103/107-key PC keyboard

With a Korean 103/107-key PC keyboard and an appropriate keyboard map, the user interface can be mostly driven with its language keys, mapped to the following input events:

한자, mapped to the abstract "Hanja" key: Set chinese conversion mode, cycling through both chinese sub-modes.
한/영, mapped to the abstract "Han/Yeong" key: This key cycles around the conversion modes, cyling through the three romaji sub-modes (all lower case, all upper case, and title case) one at a time.

The 103/107-key PC keyboard does not have a handy key to map to turning the user interface on and off. The aforegiven keys will of course turn it on; turning it off is done the way that one has to do it with non-JIS non-Korean keyboards.

Using other PC keyboards

For 101/104-key, 102/105-key, and 104/107-key PC keyboards, much of the same functionality is available via control keys.

⎈ Control+@: Turn the user interface off.
⎈ Control+G: Set hangeul conversion mode.
⎈ Control+L: (per IMLIB in OSF/1) Set hiragana conversion mode.
⎈ Control+K: (per IMLIB in OSF/1) Set katakana conversion mode.
⎈ Control+R: Set romaji conversion mode and cycle through the three romaji sub-modes (all lower case, all upper case, and title case).
⎈ Control+Z: Set chinese conversion mode, cycling through both chinese sub-modes.
⎈ Control+C: Switch to the first conversion on the conversion list.
⎈ Control+N: Switch to the raw ASCII on the conversion list.

Using a non-PC Japanese wordprocessing keyboard

With a non-PC Japanese wordprocessing keyboard and an appropriate keyboard map, the user interface can be driven with its language keys, mapped to the following input key events:

Hiragana: Set hiragana conversion mode.
Katakana: Set katakana conversion mode.
半角/全角, mapped to the abstract "Zenkaku/Hankaku" key: This key does nothing, for two reasons. First, all characters are the same width in the user-space console subsystem anyway and there is no meaning to selecting half/full width. Second, it is a logically distinct input event from the IM Switch key.

Common to all keyboards

Other keys are:

⮠ Return, Enter, Execute: (i.e. the keys on the main and numeric keypads, and any "execute" consumer key) Accept whatever is in the data-to-send edit field in its current state and send the appropriate input messages to the "lower" virtual terminal. The pre-edit and conversion list are then reset to empty.
Cursor Down, Cursor Up: Scroll up and down the current list of conversions.
Cursor Left, Cursor Right: Move the cursor back and forth across the editing field; unconverting and reconverting as it moves.
⇱ Home, ⎈ Control+A: (per emacs) Move the cursor to the beginning of the editing field; unconverting everything.
⇲ End, ⎈ Control+E: (per emacs) Move the cursor to the end of the editing field; re-converting everything.
⌦ Delete, Del: (i.e. the keys on the editing and numeric keypads) Delete the character at the cursor position.
⌫ Backspace, BS: (i.e. the keys on the main and, if present, numeric keypads) Delete the character before the cursor position and reconvert.
␣ Space: Mark a division for the converter. The table-based conversion will not convert sequences with spaces in the middle, so this is a way to limit conversion to shorter sequences. Spaces employed this way are not added to the converted character string. Spaces in unconverted data, or resulting from the original raw ASCII conversion in romaji mode, will be sent as-is, however.

console-input-method has no knowledge of the actual keys on input devices that realizers map to these input messages, or to the input messages for the actual raw ASCII to be converted. By the point that user input has reached it, it has already passed through keyboard maps.

This means that the spellings that one types can vary according to keyboard map, which in turn means that (say) "sake" or "pinyin" are typed according to the actual keyboard map in use by a realizer, rather than according to a fixed physical keyboard layout. Put another way: the ASCII spelling of a character displayed on screen is invariant no matter what the keyboard map used by the realizer, but how one physically types that spelling can vary according to QWERTY, AZERTY, Dvorak, et al. layouts.

That is the case for input methods that are based upon spelling character names or pronunciations. Conversely, input methods that are, rather, columnar usually require that the realizer be employing a QWERTY layout, as that is what their columnar conversions assume (with, for example, the Q, A, and Z keys being in the same column on the keyboard).

All conversion is direct from ASCII; in particular Japanese conversion is ASCII-to-kanji not ASCII-to-kana followed by kana-to-kanji. This is why the Space key merely acts as (an optional) conversion divider, and Henkan and Muhenkan have slightly different actions; as there is no intermediate kana stage needing a "conversion" key to progress beyond.

All of the mechanism of conversion, furthermore, is entirely encoded in the data table files. Those are where everything goes, from pseudonyms for symbols to consonant gemination being converted into "little tsu". There is no special conversion knowledge in console-input-method itself.

Security

console-input-method requires no superuser privileges and is designed to be run entirely under the aegis of a dedicated unprivileged user account. It only requires write and search access to uvcname/ and need not have owner access to it. Conversely, only the input-method process needs write access to uvcname/, as it is the only thing expected to create files there.

All created display buffer files have permissions rw-r-----. All created input FIFO files have permissions rw--w----. All display buffer files and the input FIFO file have their group IDs explicitly set to the effective GID of the input-method process. The input method process itself has owner access to these files, and their owner ID is the effective UID of the input-method process.

Usually uvcname/ will be set-group-ID to a group different to the effective group ID of the input-method process. Changing the groups of uvcname/input, uvcname/display to the effective GID of the input-method process thus distinguishes group access to those files in particular, allowing one to add ordinary users to the effective GID of the input-method process in order to give them direct realizer access to the upper terminal without (thereby) granting them (group) access to anything else in uvcname/input.

Truncating the display buffer file at (non-abend) termination ensures that (absent system backups, log-structured filesystems, and low-level data recovery) old terminal display content cannot be read out of a display buffer. For best results, place these files on a temporary filesystem, set whatever options the temporary filesystem has (if any) for erasing backing storage at unmount, and exclude the temporary filesystem from backups.

Author

Jonathan de Boyne Pollard