Discussion:
[hebrew] Hebrew Issues
Peter Kirk
2003-08-23 22:13:43 UTC
Permalink
Dear colleagues,
I have posted at http://www.qsm.co.il/Hebrew/Hebrew%20Issues.htm a draft of
my summary of the several issues concerning Hebrew that had been discussed
lately.
Regards,
Jony
Thank you, Jony. This seems to me a fair overview of the issues. I
comment below on specific points, interleaved in a snipped plain text
copy of your draft.
Hebrew Issues - DRAFT
23, 2003.
1. Background
Recently, the Unicode list has been active with discussions of
problems relating to Hebrew in general and to Biblical Hebrew in
particular.
I suggest that before any solutions are devised or any changes to
Unicode proposed, a comprehensive list of all the issues should be
prepared. This is my draft.
In the following text, the word /Bible/ refers to the Hebrew book that
is also known as the Old Testament. The term /marks/ includes
cantillation marks and other Hebrew marks.
Perhaps it should be clarified here that Unicode distinguishes by their
names three types of combining character in the Hebrew block:

1. POINTS - vowel points, dagesh, meteg, rafe, shin and sin dots.
2. ACCENTS - this is the Unicode name for cantillation marks.
3. MARKS - masora circle, upper dot.

I assume by "marks" you refer to groups 2 and 3 as you refer separately
to "points".
...
1.2 Manuscripts
A manuscript, by its very nature, is different from a printed book.
The scribe draws by hand each letter and mark. Being human, he
sometimes makes mistakes, he sometimes has his own preferences, his
pen sometimes slips, and generally the outcome is significantly less
uniform than a printed book.
Biblical scholars are endeavoring to encode such ancient manuscripts,
with all their variances, with great precision. This occasionally
demands precise control over the location of the marks, beyond that
which may be achieved by Unicode and beyond the scope of plain text
encoding. While I appreciate and support the efforts of Biblical
scholars to achieve an electronic replication of manuscripts, I
believe that some of the issues that have been raised should be
resolved by higher level protocols such as mark-up.
I think here we need to distinguish between systematic and accidental
variances. I agree that accidental variances should not be encoded in
Unicode, and that systematic ones where there is contextual consistency
should be handled by fonts etc rather than Unicode. I have been looking
primarily at differences found not in manuscripts but in scholarly
printed editions, in which the accidental variability of manuscripts has
been eliminated and the systematic distinctions have been represented
according to a consensus of respected textual scholars. I have been
concerned to identify the systematic distinctions which these scholars
have considered significant enough to distinguish in the printed text,
and to suggest means of encoding and rendering these systematic
distinctions in Unicode. See my document referred to below.
1.3 Positioning of Hebrew Points and Marks
...
The first two paragraphs are correct, and it is a pity they were not
left alone. I don't think it is the business of Unicode to specify
these complex typographic rules. But since we started with it, we have
to address a number of exceptions.
Agreed. Or we could omit this section from the next version of the
standard and move them and expand them into a technical report or note,
where the exceptions can be addressed in detail.
2. The Issues
2.1 Vav Holam
...
The result is an interchange incompatibility problem. This is a plain
text issue, and should be addressed by the UTC.
Agreed.
2.2 Holam Alef
A related problem has been raised concerning the Holam Haser followed
by the letter Alef. Often, the Holam point is printed above the right
hand side of the Alef. It is shifted from the top left of the
preceding (to the right) letter to the top right of the Alef as a
typographical convention. This is normally done when the Alef is not
pronounced.
Although the rules concerning this case are fairly straightforward,
the rendering engine should not need to know so much grammar.
I'm a little surprised, Jony, that you came to this conclusion. It seems
to me that this one is a rendering issue. You have argued before that in
most typesetting this shift is not made. It has been demonstrated (in
Ezra SIL and SBL Hebrew with Uniscribe) that it is feasible for a
rendering engine to implement these rules, in the cases where this shift
is required for high quality e.g. biblical publications. The biblical
text already contains sufficient information to guide the rendering
engine, except possibly for a few special cases, and in the spirit of
"thou shalt not add thereto" I prefer not to do so when, as here, it is
not absolutely necessary.
A possible solution is to use ZWJ to indicate the shifting of the
Holam forward. For example, Bet Dagesh Holam ZWJ Alef.
Agreed, if a mechanism is required. My preference is to use this
encoding only for special cases where the shift takes place as an
exception to the regular rules, and to use ZWNJ instead of ZWJ to
inhibit such shifting in cases where it is not required.

By the way, your example is not in canonical order (although it is in
logical order, see my comments on 2.8 below), and will be reordered to
<bet, holam, dagesh, ZWJ, alef>.
2.3 Grammar Books
In grammar books and other texts discussing the Hebrew script there
may arise a need to render various marks in isolation, without a
visible base character.
I understand the Unicode does provide a solution, as this problem is
not unique to Hebrew. However, since the suggested invisible base
character is not an RTL character, it has neutral directionality, and
an RLM may be needed.
Agreed. Use of RLM has the extra advantage of inhibiting unwanted
contraction of multiple spaces by higher level protocols.
2.4 Private Use Area
The private use area characters, which are not defined by Unicode in
any other way, are defined to have left-to-right directionality. This
prevents their use in Hebrew and Arabic.
I suggest that a small area, either in the PUA block or somewhere
else, be defined as an RTL PUA.
Good idea! Or would it be adequate to suggest that RLM be inserted
before each PUA character? Would that make them right-to-left?
2.5 Qere and Ketiv, Yerushala(y)im
...
In general, mark-up should be used to provide two alternative texts. I
don't believe it is possible or reasonable to computerize all the
possibilities that are afforded the scribe when he manually places the
points and marks of the Qere on a shorter Ketiv.
I think this is reasonable. At least Unicode fonts should not be
expected to render such things correctly. But I can see that some will
want to try to encode the mixed text form as it appears on the page. One
way to do so would be to use the sequence <RLM, NBSP> as a base
character around which the Qere points and marks can be arranged. (RLM
is necessary here to ensure correct directionality.) But Unicode should
not expect to guarantee correct rendering. And there is no need to
specify this in the standard.
For simpler cases, such as Yerushala(y)im, a zero width invisible base
character could be used. Various possibilities had been discussed. CGJ
is not appropriate because it is not a base character. ZWNBSP would
have been suitable, except that it has been taken over by the BOM.
I fail to see a good reason not to use CGJ in such a case. The Unicode
distinction between a base character and a combining character is a
technical one which does not need to align perfectly with every user's
perceptions.

The exceptional case in Exodus 20:4 of two points under one base
character where there is no omitted letter can also be dealt with well
using CGJ.
2.6 Furtive Patah
In many cases, a Patah vowel under a final Het, Alef or He is
pronounced before them, and this is indicated in fine printing by a
slight shift of the Patah to the right.
Since the rules to distinguish the Furtive are simple and
straightforward, i.e. this is a straightforward case of rendering, it
was decided at the SII that a special character is not needed.
Agreed. This is a rendering issue.
2.7 Meteg and Siluq
Unicode, following the SII, has unified the Meteg and the Siluq
because they look the same and are easy to distinguish, as Siluq
always appears before a Sof Pasuq.
The standard position of both the Meteg and the Siluq is to the left
the vowel. In some cases the Meteg is written on the right hand side
of the vowel. With Hataf vowels, some printers place the Meteg in the
middle of the Hataf.
Not just printers, this appears in MSS as well.
In some editions, the Meteg on the right indicates it was added by the
editor and does not appear in the manuscript.
But in other cases it does appear in the manuscript. BHS, the standard
scholarly edition in western countries, follows the Leningrad codex in
meteg positioning. See for example the attached from this codex, Genesis
8:6 (taken from http://www.moses.uklinux.net/sample/b19a-preview.pdf).
There are also several right metegs visible in the extract from a Lisbon
codex of 1492 at http://www.moses.uklinux.net/sample/lisbon-sample.pdf,
in the repeated "and there was evening and there was morning" in Genesis
1 - interestingly, more of them than there are in BHS, but not all
metegs are to the right.
The medial Meteg in the Hataf vowels could be a rendering issue, a
combining marks ligature. However, in this case we would need a CGNJ
when a left Meteg is needed together with a Hataf.
In the absence of a CGNJ, and since CGJ does not have defined joining
properties despite its misleading name, I have suggested using CGJ for this.
For the right Meteg, a new character is needed. Whether it should be
in the PUA or a general use Unicode is open. A private convention by
the editor of a single book, however important, indicates the PUA. If
other uses are common, then it could be a Unicode character.
This is not a matter of a single book. I have identified three Bible
editions (BHS, BHK, and Baer as reported by GKC (Gesenius, Kautzsh,
Cowley) 16g) and two manuscripts which use right meteg as a distinctive
positioning. Anyway, I would have concerns about the principle "A
private convention by the editor of a single book, however important,
indicates the PUA" in a case where electronic editions of this book are
expected to be used and quoted by a worldwide community of thousands and
extensively on the Internet, in domains where interchange of PUA
characters has not been agreed.

But I disagree that a new character is needed. This is essentially an
alternative positioning of the same combining character relative to
other combining characters with which it interferes typographically.
This should have been dealt with by appropriate allocation of combining
classes. As it was not, the appropriate mechanism seems to be to use CGJ
to inhibit canonical reordering. Thus my suggestion (= indicates
canonical equivalence):

left meteg (non-hataf vowel): <vowel, meteg> = <meteg, vowel>
right meteg: <meteg, CGJ, vowel>
medial meteg (hataf vowel): <vowel, meteg> = <meteg, vowel>
left meteg (hataf vowel): <vowel, CGJ, meteg>
2.8 Combining Classes
When a Hebrew text is normalized according to Unicode normalization
rules, the combining marks are not ordered according to the
convenience of some rendering engines.
It has been stated, however, that this is not the purpose of the
combining classes, and that the rendering engine should, in this case,
reorder the combining marks according to its preferences as part of
the rendering process.
Agreed, but this is only part of the story. Different combining classes
have been assigned to points which do interfere typographically, and
this is causing several problems. Also the canonical ordering is
illogical e.g. consonant modifiers (sin/shin dot, dagesh, rafe) are in
canonical order separated from the consonants they modify by the vowels
which logically follow; it is not the order used instinctively in typing
or by Jony in writing the example in 2.2 above. This causes problems
with collation which can only be fixed by defining hundreds of contractions.

But I understand that it is not possible to fix the errors which were
originally made in defining these classes.
2.9 Inverted Nun
In the Bible there are a few cases of a special mark known as
"Inverted Nun". It is probably not an inverted letter Nun, and
requires its own character, HEBREW MARK INVERTED NUN.
Agreed.
2.10 Extraordinary Points
The SII encoded only the upper extraordinary point, as 05C4 HEBREW
MARK UPPER DOT. A character for the lower dot could be added, although
it appears only a few times.
Agreed. Although this latter character is rare, it is in regular and
undisputed use in a widely used text, and so probably does need to be
encoded.
2.11 Broken Letters
There are in the text of the Bible a few instances of the mutilated or
broken letters Vav and Qof. I suggest this could be handled by mark-up.
Perhaps. The problem is that known mark-up languages have as far as I
know no mechanisms for handling requests for variant glyphs. But Unicode
does have such a mechanism, variation selectors. This could be a case
where it would be suitable to use them.
2.12 Number Dots
An old practice was to use dots and double dots above to distinguish
"non words", such as numbers and acronyms. For several centuries this
usage has been replaced by the use of Geresh and Gershayim.
The dots always appear on unpointed texts. There is nothing special
about them, so the existing Unicodes 0307 and 0308 could be used.
Agreed.
2.13 Shva Na vs. Shva Nah
The Hebrew vowel Shva has two meanings, known as Shva Na and Shva Nah.
Some printers desire to make the difference visible.
This is analogous to similar issues in other languages, for example
the dual meaning of s in the English word summers, and should be
handled by mark-up.
It seems to me that this is more analogous to the diacritics added to
English words in some dictionaries etc to indicate and disambiguate
their pronunciation, which can be encoded in Unicode. And again this is
not something which any known mark-up can handle. So, at least if this
is at all a regular practice and the glyphs used are at all
standardised, a good case can be made for encoding a second separate
combining character here, or possibly using a variation selector. If it
is not at all standardised, "A private convention by the editor of a
single book ... indicates the PUA."
2.14 Qamats Gadol vs. Qamats Qatan
The Hebrew vowel Qamats has two meanings, known as Qamats Gadol and
Qamats Qatan. Some printers desire to make the difference visible.
This is analogous to similar issues in other languages, for example
the dual meaning of s in the English word summers, and should be
handled by mark-up.
Same comment as on 2.13.
2.15 Vav with Dagesh vs. Shuruq
The Hebrew vowel Shuruq looks exactly like a Vav with Dagesh. Unicode,
following the SII, unified them.
Some people want to see a code for the Vav Shuruq, considering it a
separate vowel. Since there is no known typographical difference I see
no reason to do so.
I agree. But according to GKC p.55 note 2 there should actually be a
typographical difference: "/Wāw/ with /DageÅ¡/ (ו֌) cannot in our printed
texts be distinguished from /wāw/ pointed as /Šûrĕq/ (ו֌); in the latter
case the point should stand higher up."
2.16 Hiriq Male
A vowel Hiriq followed by a silent Yod is called Hiriq Male.
Some people want to see a code for Hiriq Male, considering it a
separate vowel. Since there is no known typographical difference I see
no reason to do so.
Agreed.
3. References
Issues in the Representation of Pointed Hebrew in Unicode, Second
draft, Peter Kirk, August 2003,
http://www.qaya.org/academic/hebrew/Issues-Hebrew-Unicode.doc
http://www.qaya.org/academic/hebrew/Issues-Hebrew-Unicode.html.
I intend to make some minor updates to this, which I will post at the
same location and perhaps also as a PDF.
...
You have not mentioned the following issue which I identified - the
2.6. Punctuation issues
Certain Hebrew punctuation marks are not correctly described in
Unicode 4.0.
/Sof pasuq/ is used to indicate the end of a verse in the Hebrew Bible
(although it is missing from the end of a few verses in some texts,
and completely absent from some others) and as the equivalent of a
full stop in other Hebrew writings such as prayer books. It should be
classed and processed as Terminal_Punctuation and also as a character
which typically terminates a sentence.
/Paseq/ is also used only at the ends of words, and so should also be
classed as Terminal_Punctuation, but not as terminating a sentence.
/Paseq/ has two uses, one as part of the Hebrew accent system and the
other as a special textual mark in the Hebrew Bible; it is normally
found only in the Hebrew Bible and in quotations from it.
/Maqaf/ is also generally considered to be a word divider and so
should also be classed as Terminal_Punctuation. As its usage is
analogous to that of /hyphen/ and line breaks commonly occur after it
in pointed Hebrew texts, it should also be listed in Unicode Standard
Annex #14, along with /hyphen/, as a “break opportunity after”.
I hope the above will help you in revising and completing your draft
document.
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/


------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
John Hudson
2003-08-24 17:56:55 UTC
Permalink
[Bcc'd to the SBL BibLit project discussion list.]
Post by Peter Kirk
2.2 Holam Alef
...
Post by Peter Kirk
Although the rules concerning this case are fairly straightforward, the
rendering engine should not need to know so much grammar.
I'm a little surprised, Jony, that you came to this conclusion. It seems
to me that this one is a rendering issue. You have argued before that in
most typesetting this shift is not made. It has been demonstrated (in Ezra
SIL and SBL Hebrew with Uniscribe) that it is feasible for a rendering
engine to implement these rules, in the cases where this shift is required
for high quality e.g. biblical publications. The biblical text already
contains sufficient information to guide the rendering engine, except
possibly for a few special cases, and in the spirit of "thou shalt not add
thereto" I prefer not to do so when, as here, it is not absolutely necessary.
I agree with Peter, it is not a problem for the rendering (in this case
font lookups) to handle this holam repositioning contextually.
Post by Peter Kirk
A possible solution is to use ZWJ to indicate the shifting of the Holam
forward. For example, Bet Dagesh Holam ZWJ Alef.
Agreed, if a mechanism is required. My preference is to use this encoding
only for special cases where the shift takes place as an exception to the
regular rules, and to use ZWNJ instead of ZWJ to inhibit such shifting in
cases where it is not required.
Again, I agree:

<bet, dagesh, holam, alef> = holam repositioned on alef

<bet, dagesh, holam, ZWNJ, alef> = holam retained on bet
Post by Peter Kirk
By the way, your example is not in canonical order (although it is in
logical order, see my comments on 2.8 below), and will be reordered to
<bet, holam, dagesh, ZWJ, alef>.
Thankfully, this is one of the mark reordering cases that the font lookups
can handle: we just need to make sure that the context is large enough for
other marks to fall between the holam and the alef. However, this does
raise the question of what happens to the ZWNJ in reordering

<bet, dagesh, holam, ZWNJ, alef>

If the holam ends up reordered before the dagesh, where does the ZWNJ end
up? If it remains immediately in front of the alef, that's fine.
Post by Peter Kirk
For simpler cases, such as Yerushala(y)im, a zero width invisible base
character could be used. Various possibilities had been discussed. CGJ is
not appropriate because it is not a base character. ZWNBSP would have
been suitable, except that it has been taken over by the BOM.
I fail to see a good reason not to use CGJ in such a case. The Unicode
distinction between a base character and a combining character is a
technical one which does not need to align perfectly with every user's
perceptions.
I agree. I understand the logic in inserting an invisible base character in
a place where readers 'know' there is a missing consonant, but the
consonant *is* missing, it is not there and should not be there. CGJ works
fine in this instance, because the only important thing to do is to make
sure that the two vowels are not reordered.
Post by Peter Kirk
The medial Meteg in the Hataf vowels could be a rendering issue, a
combining marks ligature. However, in this case we would need a CGNJ when
a left Meteg is needed together with a Hataf.
In the absence of a CGNJ, and since CGJ does not have defined joining
properties despite its misleading name, I have suggested using CGJ for this.
Since actual glyph ligation is occuring, the ZWNJ should be used to inhibit
ligation. This is consistent with the Unicode 4.0 description of ZWJ and
ZWNJ behaviour. A question remains, however: should medial meteg with hataf
be the default rendering of <hataf..., meteg>, or should such ligation
require <hataf..., ZWJ, meteg>? This is a rendering issue, but one which
affects encoding: if one set of fonts treats ligation as default and
another set doesn't, users will produce documents with conflicting encoding
conventions depending on the rendering of the fonts they are using (one can
even imagine a single document, set in multiple fonts, using different
character sequences to obtain the same rendering). Personally, I favour
having the medial meteg as default rendering for <hataf..., meteg>,
requiring <hataf..., ZWNJ, meteg> in order to obtain a left meteg, because
the medial meteg appears to be the most common positioning in the
manuscript tradition.
Post by Peter Kirk
For the right Meteg, a new character is needed.
...
Post by Peter Kirk
But I disagree that a new character is needed. This is essentially an
alternative positioning of the same combining character relative to other
combining characters with which it interferes typographically. This should
have been dealt with by appropriate allocation of combining classes. As it
was not, the appropriate mechanism seems to be to use CGJ to inhibit
left meteg (non-hataf vowel): <vowel, meteg> = <meteg, vowel>
right meteg: <meteg, CGJ, vowel>
medial meteg (hataf vowel): <vowel, meteg> = <meteg, vowel>
left meteg (hataf vowel): <vowel, CGJ, meteg>
I basically agree, with the following modification:

left meteg (hataf vowel): <vowel, ZWNJ, meteg>

Does this mean that we are agreed that the medial meteg rendering should be
normative?
Post by Peter Kirk
2.9 Inverted Nun
In the Bible there are a few cases of a special mark known as "Inverted
Nun". It is probably not an inverted letter Nun, and requires its own
character, HEBREW MARK INVERTED NUN.
Agreed.
Agreed. Who wants to write the proposal? I have some good graphics showing
various manuscript forms of this letter, clearly distinguished in form from
the nun.
Post by Peter Kirk
2.10 Extraordinary Points
The SII encoded only the upper extraordinary point, as 05C4 HEBREW MARK
UPPER DOT. A character for the lower dot could be added, although it
appears only a few times.
Agreed. Although this latter character is rare, it is in regular and
undisputed use in a widely used text, and so probably does need to be encoded.
I am content either to have the lower punctum encoded or to use a generic
combining mark (U+0323), although the latter raises issues for multiscript
fonts in applications that do not support writing system-specific glyph
substitution (currently all applications). What I am most keen to have is a
clear statement from the UTC identifying 05C4 HEBREW MARK UPPER DOT as the
upper punctum, as Jony indicates was intended by SII, and specifying a
codepoint for the Hebrew number / masoretic note dot, which requires its
own glyph and cannot be harmonised with the upper punctum character. Again,
this could mean a new Hebrew block character or U+0307 could be used.

Note that until Jony's note on SII's intent, I had presumed U+05C4 to be
the number / masoretic note dot, because of the absence of a corresponding
lower mark to indicate that it was the upper punctum. Now I would like a
definitive ruling from the UTC, to avoid future confusion.
Post by Peter Kirk
2.12 Number Dots
An old practice was to use dots and double dots above to distinguish "non
words", such as numbers and acronyms. For several centuries this usage
has been replaced by the use of Geresh and Gershayim.
The dots always appear on unpointed texts. There is nothing special about
them, so the existing Unicodes 0307 and 0308 could be used.
Agreed.
Okay, that's fine with me, but I'd still like to see a note in the standard
re. U+05C4.


John Hudson

Tiro Typeworks www.tiro.com
Vancouver, BC ***@tiro.com

The sight of James Cox from the BBC's World at One,
interviewing Robin Oakley, CNN's man in Europe,
surrounded by a scrum of furiously scribbling print
journalists will stand for some time as the apogee of
media cannibalism.
- Emma Brockes, at the EU summit



------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Peter Kirk
2003-08-24 21:36:00 UTC
Permalink
... However, this does raise the question of what happens to the ZWNJ
in reordering
<bet, dagesh, holam, ZWNJ, alef>
If the holam ends up reordered before the dagesh, where does the ZWNJ
end up? If it remains immediately in front of the alef, that's fine.
ZWNJ is not a combining character and so is unaffected by canonical
reordering. Combining characters can never move from before it to after
it or vice versa. Although CGJ is a combining character, it has the same
effect on ordering as ZWNJ as its combining class is zero.
...
Post by Peter Kirk
In the absence of a CGNJ, and since CGJ does not have defined joining
properties despite its misleading name, I have suggested using CGJ for this.
Since actual glyph ligation is occuring, the ZWNJ should be used to
inhibit ligation. This is consistent with the Unicode 4.0 description
of ZWJ and ZWNJ behaviour. ...
But this is where the problem comes. Because ZWJ and ZWNJ are not
combining characters, they (theoretically, though not necessarily in
your implementation) break the combining character sequence and so the
link between the combining characters which follow it and the base
character. In fact the following combining characters become a defective
combining sequence whose rendering is undefined. I think MS Word
currently inserts a dotted circle in this case, and this is conformant
behaviour in the case of a defective combining sequence.

Is this correct, anyone, or am I overstating my case? Actually ZWJ is
theoretically less of a problem because it does specify a ligature
between the preceding and following combining character sequences. But
ZWNJ specifies that they should be rendered separately.
... A question remains, however: should medial meteg with hataf be the
default rendering of <hataf..., meteg>, or should such ligation
require <hataf..., ZWJ, meteg>? This is a rendering issue, but one
which affects encoding: if one set of fonts treats ligation as default
and another set doesn't, users will produce documents with conflicting
encoding conventions depending on the rendering of the fonts they are
using (one can even imagine a single document, set in multiple fonts,
using different character sequences to obtain the same rendering).
Personally, I favour having the medial meteg as default rendering for
<hataf..., meteg>, requiring <hataf..., ZWNJ, meteg> in order to
obtain a left meteg, because the medial meteg appears to be the most
common positioning in the manuscript tradition.
If we do use ZWJ/ZWNJ, and based on the principle in the standard (TUS
4.0 pp. 389-390) "These characters are not to be used in all cases where
ligatures or cursive connections are desired; instead, they are only for
overriding the
normal behavior of the text", I would suggest that <hataf, meteg> should
be rendered according to the font default which may vary (medial for a
font based on BHS, left meteg for a font based on an edition in which
this is the default); <hataf, ZWJ, meteg> should be used to prefer
medial despite the default (not sure if this is ever required); and
<hataf, ZWNJ, meteg> to inhibit medial when this must not be used (as
in a few cases in BHS).
...
Post by Peter Kirk
left meteg (non-hataf vowel): <vowel, meteg> = <meteg, vowel>
right meteg: <meteg, CGJ, vowel>
medial meteg (hataf vowel): <vowel, meteg> = <meteg, vowel>
left meteg (hataf vowel): <vowel, CGJ, meteg>
left meteg (hataf vowel): <vowel, ZWNJ, meteg>
See the reasons above for not using this.
Does this mean that we are agreed that the medial meteg rendering
should be normative?
I am not intending to say that. I want to say that it can be the default
for a particular font or perhaps a font level attribute. Other fonts
might have left meteg as the default with hatafs and no medial meteg
glyphs; in that case the CGJ or ZWNJ would be ignored. Or they might
have left meteg as the default but also have medial meteg glyphs, in
which case a different mechanism would be required to request use of the
medial meteg, perhaps with ZWJ.

So here is a more nuanced version of my suggestion:

left meteg (non-hataf vowel): <vowel, meteg> = <meteg, vowel>
right meteg: <meteg, CGJ, vowel>
font's default position of meteg (hataf vowel): <vowel, meteg> = <meteg,
vowel>
medial meteg (hataf vowel) (if supported by the font): TBD (<vowel, ZWJ,
meteg> ???)
left meteg (hataf vowel): <vowel, CGJ, meteg>
...
Post by Peter Kirk
2.10 Extraordinary Points
The SII encoded only the upper extraordinary point, as 05C4 HEBREW
MARK UPPER DOT. A character for the lower dot could be added,
although it appears only a few times.
Agreed. Although this latter character is rare, it is in regular and
undisputed use in a widely used text, and so probably does need to be encoded.
I am content either to have the lower punctum encoded or to use a
generic combining mark (U+0323), although the latter raises issues for
multiscript fonts in applications that do not support writing
system-specific glyph substitution (currently all applications). ...
Presumably a font could be programmed to substitute a glyph based on
context, especially for a combining mark where it would be relatively
simple to determine that the base character is in the Hebrew block and
so the Hebrew glyph variant is required. No help of course if you want
an isolated diacritic or a Qere without Ketiv form.
... What I am most keen to have is a clear statement from the UTC
identifying 05C4 HEBREW MARK UPPER DOT as the upper punctum, as Jony
indicates was intended by SII, and specifying a codepoint for the
Hebrew number / masoretic note dot, which requires its own glyph and
cannot be harmonised with the upper punctum character. Again, this
could mean a new Hebrew block character or U+0307 could be used.
Note that until Jony's note on SII's intent, I had presumed U+05C4 to
be the number / masoretic note dot, because of the absence of a
corresponding lower mark to indicate that it was the upper punctum.
Now I would like a definitive ruling from the UTC, to avoid future
confusion.
Agreed. Notes should be added to the code charts for U+05C4, e.g. "=
upper punctum extraordinarium", and for U+0307 e.g. "= Hebrew number
dot", each with pointers to the other.
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/




------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
John Hudson
2003-08-25 00:22:40 UTC
Permalink
Post by Peter Kirk
Post by John Hudson
Since actual glyph ligation is occuring, the ZWNJ should be used to
inhibit ligation. This is consistent with the Unicode 4.0 description of
ZWJ and ZWNJ behaviour. ...
But this is where the problem comes. Because ZWJ and ZWNJ are not
combining characters, they (theoretically, though not necessarily in your
implementation) break the combining character sequence and so the link
between the combining characters which follow it and the base character.
In fact the following combining characters become a defective combining
sequence whose rendering is undefined. I think MS Word currently inserts a
dotted circle in this case, and this is conformant behaviour in the case
of a defective combining sequence.
Is this correct, anyone, or am I overstating my case? Actually ZWJ is
theoretically less of a problem because it does specify a ligature between
the preceding and following combining character sequences. But ZWNJ
specifies that they should be rendered separately.
Don't be too concerned about what happens in Word. There are known bugs
that affect Biblical Hebrew, and there are known problems with using
control characters in some circumstances. However, the issue regarding
insertion of ZWJ and ZWNJ between combining marks needs clarification: word
from Paul Nelson at MS is that it definitely should be possible to insert
these characters between combining marks and so to affect the relationship
of the marks on either side of the control character, i.e. not breaking the
combining of the mark(s) following the control character with the preceding
base character. This does, however, require glyph-space processing of the
control characters.
Post by Peter Kirk
Post by John Hudson
... A question remains, however: should medial meteg with hataf be the
default rendering of <hataf..., meteg>, or should such ligation require
<hataf..., ZWJ, meteg>? This is a rendering issue, but one which affects
encoding: if one set of fonts treats ligation as default and another set
doesn't, users will produce documents with conflicting encoding
conventions depending on the rendering of the fonts they are using (one
can even imagine a single document, set in multiple fonts, using
different character sequences to obtain the same rendering). Personally,
I favour having the medial meteg as default rendering for <hataf...,
meteg>, requiring <hataf..., ZWNJ, meteg> in order to obtain a left
meteg, because the medial meteg appears to be the most common positioning
in the manuscript tradition.
If we do use ZWJ/ZWNJ, and based on the principle in the standard (TUS 4.0
pp. 389-390) "These characters are not to be used in all cases where
ligatures or cursive connections are desired; instead, they are only for
overriding the
normal behavior of the text", I would suggest that <hataf, meteg> should
be rendered according to the font default which may vary (medial for a
font based on BHS, left meteg for a font based on an edition in which this
is the default); <hataf, ZWJ, meteg> should be used to prefer medial
despite the default (not sure if this is ever required); and <hataf,
ZWNJ, meteg> to inhibit medial when this must not be used (as in a few
cases in BHS).
Okay, so we have:

<hataf, meteg> = variable rendering depending on font
<hataf, ZWJ, meteg> = always medial ligated form
<hataf, ZWNJ, meteg> = always left meteg (post hataf)
<meteg, CGJ, hataf> = always right meteg (pre hataf)

I'm reasonably comfortable with that, but it suggests that authors and
editors producing electronic documents, e.g. for web publishing, should
always expressly encode their preference using ZWJ and ZWNJ, since they
can't always or reliably determine what font will be used to display the text.
Post by Peter Kirk
Post by John Hudson
Post by Peter Kirk
left meteg (non-hataf vowel): <vowel, meteg> = <meteg, vowel>
right meteg: <meteg, CGJ, vowel>
medial meteg (hataf vowel): <vowel, meteg> = <meteg, vowel>
left meteg (hataf vowel): <vowel, CGJ, meteg>
left meteg (hataf vowel): <vowel, ZWNJ, meteg>
See the reasons above for not using this.
Post by John Hudson
Does this mean that we are agreed that the medial meteg rendering should
be normative?
I am not intending to say that. I want to say that it can be the default
for a particular font or perhaps a font level attribute. Other fonts might
have left meteg as the default with hatafs and no medial meteg glyphs; in
that case the CGJ or ZWNJ would be ignored. Or they might have left meteg
as the default but also have medial meteg glyphs, in which case a
different mechanism would be required to request use of the medial meteg,
perhaps with ZWJ.
left meteg (non-hataf vowel): <vowel, meteg> = <meteg, vowel>
right meteg: <meteg, CGJ, vowel>
font's default position of meteg (hataf vowel): <vowel, meteg> = <meteg,
vowel>
medial meteg (hataf vowel) (if supported by the font): TBD (<vowel, ZWJ,
meteg> ???)
left meteg (hataf vowel): <vowel, CGJ, meteg>
Post by John Hudson
...
Post by Peter Kirk
2.10 Extraordinary Points
The SII encoded only the upper extraordinary point, as 05C4 HEBREW MARK
UPPER DOT. A character for the lower dot could be added, although it
appears only a few times.
Agreed. Although this latter character is rare, it is in regular and
undisputed use in a widely used text, and so probably does need to be encoded.
I am content either to have the lower punctum encoded or to use a generic
combining mark (U+0323), although the latter raises issues for
multiscript fonts in applications that do not support writing
system-specific glyph substitution (currently all applications). ...
Presumably a font could be programmed to substitute a glyph based on
context, especially for a combining mark where it would be relatively
simple to determine that the base character is in the Hebrew block and so
the Hebrew glyph variant is required. No help of course if you want an
isolated diacritic or a Qere without Ketiv form.
Yes, this is possible, although the OpenType architecture is designed to
deal with exactly this kind of language-specific substitution without
needing to use glyph context, using the Language System tag and the
Localised Forms <locl> layout feature. So I'd consider the glyph context
approach to be a hack for apps that are not aware of Language System tags
or don't process <locl>.
Post by Peter Kirk
Post by John Hudson
... What I am most keen to have is a clear statement from the UTC
identifying 05C4 HEBREW MARK UPPER DOT as the upper punctum, as Jony
indicates was intended by SII, and specifying a codepoint for the Hebrew
number / masoretic note dot, which requires its own glyph and cannot be
harmonised with the upper punctum character. Again, this could mean a new
Hebrew block character or U+0307 could be used.
Note that until Jony's note on SII's intent, I had presumed U+05C4 to be
the number / masoretic note dot, because of the absence of a
corresponding lower mark to indicate that it was the upper punctum. Now I
would like a definitive ruling from the UTC, to avoid future confusion.
Agreed. Notes should be added to the code charts for U+05C4, e.g. "= upper
punctum extraordinarium", and for U+0307 e.g. "= Hebrew number dot", each
with pointers to the other.
A question for Ken Whistler, if he is still following this: since Jony hgas
indicated that SII intended U+05C4 for the upper punctum extraordinarium,
is this sufficient for the editors of the standard to make a clarification
in the text without a decision from the UTC? Even though this reverses my
own interpretation of this character, I'm most keen to see a speedy resolution.

John Hudson

Tiro Typeworks www.tiro.com
Vancouver, BC ***@tiro.com

The sight of James Cox from the BBC's World at One,
interviewing Robin Oakley, CNN's man in Europe,
surrounded by a scrum of furiously scribbling print
journalists will stand for some time as the apogee of
media cannibalism.
- Emma Brockes, at the EU summit



------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
John Cowan
2003-08-25 03:59:05 UTC
Permalink
Post by Peter Kirk
I suggest that a small area, either in the PUA block or somewhere
else, be defined as an RTL PUA.
Good idea! Or would it be adequate to suggest that RLM be inserted
before each PUA character? Would that make them right-to-left?
No, it would not. RLM is basically an invisible character whose only
property is RTL-ness; it can influence the direction of a nearby neutral,
or set the base direction for a text that should be RTL but happens to
begin with a strong LTR character, but it cannot change the directionality
of an existing LTR character.

However, the desired effect can be achieved by preceding each PUA
character, or sequence thereof, with RLO (U+202E) and following the
character or sequence with PDF (U+202C). All characters between
RLO and PDF are treated as strong RTL characters.
Post by Peter Kirk
Perhaps. The problem is that known mark-up languages have as far as I
know no mechanisms for handling requests for variant glyphs.
For special purposes such as this, it is reasonable for Biblical scholars
to use their own markup languages or extensions to existing ones.
It would also be reasonable to contact the Style WG of the WWW Consortium
to discuss the possibility of adding some or all of the desired features
to the rendering languages CSS and XSL:FO.
--
You are a child of the universe no less John Cowan
than the trees and all other acyclic http://www.reutershealth.com
graphs; you have a right to be here. http://www.ccil.org/~cowan
--DeXiderata by Sean McGrath ***@reutershealth.com


------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Peter Kirk
2003-08-25 10:39:09 UTC
Permalink
Post by John Cowan
Post by Peter Kirk
I suggest that a small area, either in the PUA block or somewhere
else, be defined as an RTL PUA.
Good idea! Or would it be adequate to suggest that RLM be inserted
before each PUA character? Would that make them right-to-left?
No, it would not. RLM is basically an invisible character whose only
property is RTL-ness; it can influence the direction of a nearby neutral,
or set the base direction for a text that should be RTL but happens to
begin with a strong LTR character, but it cannot change the directionality
of an existing LTR character.
However, the desired effect can be achieved by preceding each PUA
character, or sequence thereof, with RLO (U+202E) and following the
character or sequence with PDF (U+202C). All characters between
RLO and PDF are treated as strong RTL characters.
Thank you. So I guess this is the appropriate mechanism for simulating
an RTL PUA. Which is not to say that there shouldn't be a real,
non-simulated one.
Post by John Cowan
Post by Peter Kirk
Perhaps. The problem is that known mark-up languages have as far as I
know no mechanisms for handling requests for variant glyphs.
For special purposes such as this, it is reasonable for Biblical scholars
to use their own markup languages or extensions to existing ones.
It would also be reasonable to contact the Style WG of the WWW Consortium
to discuss the possibility of adding some or all of the desired features
to the rendering languages CSS and XSL:FO.
Maybe. But I think it would also be reasonable for this WG to refer the
matter back to Unicode on the basis that variant glyphs of this kind are
an issue for Unicode rather than for markup. And I would agree.
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/




------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com

This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Loading...