Peter Kirk
2003-08-23 22:13:43 UTC
Dear colleagues,
I have posted at http://www.qsm.co.il/Hebrew/Hebrew%20Issues.htm a draft of
my summary of the several issues concerning Hebrew that had been discussed
lately.
Regards,
Jony
Thank you, Jony. This seems to me a fair overview of the issues. II have posted at http://www.qsm.co.il/Hebrew/Hebrew%20Issues.htm a draft of
my summary of the several issues concerning Hebrew that had been discussed
lately.
Regards,
Jony
comment below on specific points, interleaved in a snipped plain text
copy of your draft.
Hebrew Issues - DRAFT
23, 2003.
1. Background
Recently, the Unicode list has been active with discussions of
problems relating to Hebrew in general and to Biblical Hebrew in
particular.
I suggest that before any solutions are devised or any changes to
Unicode proposed, a comprehensive list of all the issues should be
prepared. This is my draft.
In the following text, the word /Bible/ refers to the Hebrew book that
is also known as the Old Testament. The term /marks/ includes
cantillation marks and other Hebrew marks.
Perhaps it should be clarified here that Unicode distinguishes by their23, 2003.
1. Background
Recently, the Unicode list has been active with discussions of
problems relating to Hebrew in general and to Biblical Hebrew in
particular.
I suggest that before any solutions are devised or any changes to
Unicode proposed, a comprehensive list of all the issues should be
prepared. This is my draft.
In the following text, the word /Bible/ refers to the Hebrew book that
is also known as the Old Testament. The term /marks/ includes
cantillation marks and other Hebrew marks.
names three types of combining character in the Hebrew block:
1. POINTS - vowel points, dagesh, meteg, rafe, shin and sin dots.
2. ACCENTS - this is the Unicode name for cantillation marks.
3. MARKS - masora circle, upper dot.
I assume by "marks" you refer to groups 2 and 3 as you refer separately
to "points".
...
1.2 Manuscripts
A manuscript, by its very nature, is different from a printed book.
The scribe draws by hand each letter and mark. Being human, he
sometimes makes mistakes, he sometimes has his own preferences, his
pen sometimes slips, and generally the outcome is significantly less
uniform than a printed book.
Biblical scholars are endeavoring to encode such ancient manuscripts,
with all their variances, with great precision. This occasionally
demands precise control over the location of the marks, beyond that
which may be achieved by Unicode and beyond the scope of plain text
encoding. While I appreciate and support the efforts of Biblical
scholars to achieve an electronic replication of manuscripts, I
believe that some of the issues that have been raised should be
resolved by higher level protocols such as mark-up.
I think here we need to distinguish between systematic and accidental1.2 Manuscripts
A manuscript, by its very nature, is different from a printed book.
The scribe draws by hand each letter and mark. Being human, he
sometimes makes mistakes, he sometimes has his own preferences, his
pen sometimes slips, and generally the outcome is significantly less
uniform than a printed book.
Biblical scholars are endeavoring to encode such ancient manuscripts,
with all their variances, with great precision. This occasionally
demands precise control over the location of the marks, beyond that
which may be achieved by Unicode and beyond the scope of plain text
encoding. While I appreciate and support the efforts of Biblical
scholars to achieve an electronic replication of manuscripts, I
believe that some of the issues that have been raised should be
resolved by higher level protocols such as mark-up.
variances. I agree that accidental variances should not be encoded in
Unicode, and that systematic ones where there is contextual consistency
should be handled by fonts etc rather than Unicode. I have been looking
primarily at differences found not in manuscripts but in scholarly
printed editions, in which the accidental variability of manuscripts has
been eliminated and the systematic distinctions have been represented
according to a consensus of respected textual scholars. I have been
concerned to identify the systematic distinctions which these scholars
have considered significant enough to distinguish in the printed text,
and to suggest means of encoding and rendering these systematic
distinctions in Unicode. See my document referred to below.
1.3 Positioning of Hebrew Points and Marks
...
The first two paragraphs are correct, and it is a pity they were not
left alone. I don't think it is the business of Unicode to specify
these complex typographic rules. But since we started with it, we have
to address a number of exceptions.
Agreed. Or we could omit this section from the next version of the...
The first two paragraphs are correct, and it is a pity they were not
left alone. I don't think it is the business of Unicode to specify
these complex typographic rules. But since we started with it, we have
to address a number of exceptions.
standard and move them and expand them into a technical report or note,
where the exceptions can be addressed in detail.
2. The Issues
2.1 Vav Holam
...
The result is an interchange incompatibility problem. This is a plain
text issue, and should be addressed by the UTC.
Agreed.2.1 Vav Holam
...
The result is an interchange incompatibility problem. This is a plain
text issue, and should be addressed by the UTC.
2.2 Holam Alef
A related problem has been raised concerning the Holam Haser followed
by the letter Alef. Often, the Holam point is printed above the right
hand side of the Alef. It is shifted from the top left of the
preceding (to the right) letter to the top right of the Alef as a
typographical convention. This is normally done when the Alef is not
pronounced.
Although the rules concerning this case are fairly straightforward,
the rendering engine should not need to know so much grammar.
I'm a little surprised, Jony, that you came to this conclusion. It seemsA related problem has been raised concerning the Holam Haser followed
by the letter Alef. Often, the Holam point is printed above the right
hand side of the Alef. It is shifted from the top left of the
preceding (to the right) letter to the top right of the Alef as a
typographical convention. This is normally done when the Alef is not
pronounced.
Although the rules concerning this case are fairly straightforward,
the rendering engine should not need to know so much grammar.
to me that this one is a rendering issue. You have argued before that in
most typesetting this shift is not made. It has been demonstrated (in
Ezra SIL and SBL Hebrew with Uniscribe) that it is feasible for a
rendering engine to implement these rules, in the cases where this shift
is required for high quality e.g. biblical publications. The biblical
text already contains sufficient information to guide the rendering
engine, except possibly for a few special cases, and in the spirit of
"thou shalt not add thereto" I prefer not to do so when, as here, it is
not absolutely necessary.
A possible solution is to use ZWJ to indicate the shifting of the
Holam forward. For example, Bet Dagesh Holam ZWJ Alef.
Agreed, if a mechanism is required. My preference is to use thisHolam forward. For example, Bet Dagesh Holam ZWJ Alef.
encoding only for special cases where the shift takes place as an
exception to the regular rules, and to use ZWNJ instead of ZWJ to
inhibit such shifting in cases where it is not required.
By the way, your example is not in canonical order (although it is in
logical order, see my comments on 2.8 below), and will be reordered to
<bet, holam, dagesh, ZWJ, alef>.
2.3 Grammar Books
In grammar books and other texts discussing the Hebrew script there
may arise a need to render various marks in isolation, without a
visible base character.
I understand the Unicode does provide a solution, as this problem is
not unique to Hebrew. However, since the suggested invisible base
character is not an RTL character, it has neutral directionality, and
an RLM may be needed.
Agreed. Use of RLM has the extra advantage of inhibiting unwantedIn grammar books and other texts discussing the Hebrew script there
may arise a need to render various marks in isolation, without a
visible base character.
I understand the Unicode does provide a solution, as this problem is
not unique to Hebrew. However, since the suggested invisible base
character is not an RTL character, it has neutral directionality, and
an RLM may be needed.
contraction of multiple spaces by higher level protocols.
2.4 Private Use Area
The private use area characters, which are not defined by Unicode in
any other way, are defined to have left-to-right directionality. This
prevents their use in Hebrew and Arabic.
I suggest that a small area, either in the PUA block or somewhere
else, be defined as an RTL PUA.
Good idea! Or would it be adequate to suggest that RLM be insertedThe private use area characters, which are not defined by Unicode in
any other way, are defined to have left-to-right directionality. This
prevents their use in Hebrew and Arabic.
I suggest that a small area, either in the PUA block or somewhere
else, be defined as an RTL PUA.
before each PUA character? Would that make them right-to-left?
2.5 Qere and Ketiv, Yerushala(y)im
...
In general, mark-up should be used to provide two alternative texts. I
don't believe it is possible or reasonable to computerize all the
possibilities that are afforded the scribe when he manually places the
points and marks of the Qere on a shorter Ketiv.
I think this is reasonable. At least Unicode fonts should not be...
In general, mark-up should be used to provide two alternative texts. I
don't believe it is possible or reasonable to computerize all the
possibilities that are afforded the scribe when he manually places the
points and marks of the Qere on a shorter Ketiv.
expected to render such things correctly. But I can see that some will
want to try to encode the mixed text form as it appears on the page. One
way to do so would be to use the sequence <RLM, NBSP> as a base
character around which the Qere points and marks can be arranged. (RLM
is necessary here to ensure correct directionality.) But Unicode should
not expect to guarantee correct rendering. And there is no need to
specify this in the standard.
For simpler cases, such as Yerushala(y)im, a zero width invisible base
character could be used. Various possibilities had been discussed. CGJ
is not appropriate because it is not a base character. ZWNBSP would
have been suitable, except that it has been taken over by the BOM.
I fail to see a good reason not to use CGJ in such a case. The Unicodecharacter could be used. Various possibilities had been discussed. CGJ
is not appropriate because it is not a base character. ZWNBSP would
have been suitable, except that it has been taken over by the BOM.
distinction between a base character and a combining character is a
technical one which does not need to align perfectly with every user's
perceptions.
The exceptional case in Exodus 20:4 of two points under one base
character where there is no omitted letter can also be dealt with well
using CGJ.
2.6 Furtive Patah
In many cases, a Patah vowel under a final Het, Alef or He is
pronounced before them, and this is indicated in fine printing by a
slight shift of the Patah to the right.
Since the rules to distinguish the Furtive are simple and
straightforward, i.e. this is a straightforward case of rendering, it
was decided at the SII that a special character is not needed.
Agreed. This is a rendering issue.In many cases, a Patah vowel under a final Het, Alef or He is
pronounced before them, and this is indicated in fine printing by a
slight shift of the Patah to the right.
Since the rules to distinguish the Furtive are simple and
straightforward, i.e. this is a straightforward case of rendering, it
was decided at the SII that a special character is not needed.
2.7 Meteg and Siluq
Unicode, following the SII, has unified the Meteg and the Siluq
because they look the same and are easy to distinguish, as Siluq
always appears before a Sof Pasuq.
The standard position of both the Meteg and the Siluq is to the left
the vowel. In some cases the Meteg is written on the right hand side
of the vowel. With Hataf vowels, some printers place the Meteg in the
middle of the Hataf.
Not just printers, this appears in MSS as well.Unicode, following the SII, has unified the Meteg and the Siluq
because they look the same and are easy to distinguish, as Siluq
always appears before a Sof Pasuq.
The standard position of both the Meteg and the Siluq is to the left
the vowel. In some cases the Meteg is written on the right hand side
of the vowel. With Hataf vowels, some printers place the Meteg in the
middle of the Hataf.
In some editions, the Meteg on the right indicates it was added by the
editor and does not appear in the manuscript.
But in other cases it does appear in the manuscript. BHS, the standardeditor and does not appear in the manuscript.
scholarly edition in western countries, follows the Leningrad codex in
meteg positioning. See for example the attached from this codex, Genesis
8:6 (taken from http://www.moses.uklinux.net/sample/b19a-preview.pdf).
There are also several right metegs visible in the extract from a Lisbon
codex of 1492 at http://www.moses.uklinux.net/sample/lisbon-sample.pdf,
in the repeated "and there was evening and there was morning" in Genesis
1 - interestingly, more of them than there are in BHS, but not all
metegs are to the right.
The medial Meteg in the Hataf vowels could be a rendering issue, a
combining marks ligature. However, in this case we would need a CGNJ
when a left Meteg is needed together with a Hataf.
In the absence of a CGNJ, and since CGJ does not have defined joiningcombining marks ligature. However, in this case we would need a CGNJ
when a left Meteg is needed together with a Hataf.
properties despite its misleading name, I have suggested using CGJ for this.
For the right Meteg, a new character is needed. Whether it should be
in the PUA or a general use Unicode is open. A private convention by
the editor of a single book, however important, indicates the PUA. If
other uses are common, then it could be a Unicode character.
This is not a matter of a single book. I have identified three Biblein the PUA or a general use Unicode is open. A private convention by
the editor of a single book, however important, indicates the PUA. If
other uses are common, then it could be a Unicode character.
editions (BHS, BHK, and Baer as reported by GKC (Gesenius, Kautzsh,
Cowley) 16g) and two manuscripts which use right meteg as a distinctive
positioning. Anyway, I would have concerns about the principle "A
private convention by the editor of a single book, however important,
indicates the PUA" in a case where electronic editions of this book are
expected to be used and quoted by a worldwide community of thousands and
extensively on the Internet, in domains where interchange of PUA
characters has not been agreed.
But I disagree that a new character is needed. This is essentially an
alternative positioning of the same combining character relative to
other combining characters with which it interferes typographically.
This should have been dealt with by appropriate allocation of combining
classes. As it was not, the appropriate mechanism seems to be to use CGJ
to inhibit canonical reordering. Thus my suggestion (= indicates
canonical equivalence):
left meteg (non-hataf vowel): <vowel, meteg> = <meteg, vowel>
right meteg: <meteg, CGJ, vowel>
medial meteg (hataf vowel): <vowel, meteg> = <meteg, vowel>
left meteg (hataf vowel): <vowel, CGJ, meteg>
2.8 Combining Classes
When a Hebrew text is normalized according to Unicode normalization
rules, the combining marks are not ordered according to the
convenience of some rendering engines.
It has been stated, however, that this is not the purpose of the
combining classes, and that the rendering engine should, in this case,
reorder the combining marks according to its preferences as part of
the rendering process.
Agreed, but this is only part of the story. Different combining classesWhen a Hebrew text is normalized according to Unicode normalization
rules, the combining marks are not ordered according to the
convenience of some rendering engines.
It has been stated, however, that this is not the purpose of the
combining classes, and that the rendering engine should, in this case,
reorder the combining marks according to its preferences as part of
the rendering process.
have been assigned to points which do interfere typographically, and
this is causing several problems. Also the canonical ordering is
illogical e.g. consonant modifiers (sin/shin dot, dagesh, rafe) are in
canonical order separated from the consonants they modify by the vowels
which logically follow; it is not the order used instinctively in typing
or by Jony in writing the example in 2.2 above. This causes problems
with collation which can only be fixed by defining hundreds of contractions.
But I understand that it is not possible to fix the errors which were
originally made in defining these classes.
2.9 Inverted Nun
In the Bible there are a few cases of a special mark known as
"Inverted Nun". It is probably not an inverted letter Nun, and
requires its own character, HEBREW MARK INVERTED NUN.
Agreed.In the Bible there are a few cases of a special mark known as
"Inverted Nun". It is probably not an inverted letter Nun, and
requires its own character, HEBREW MARK INVERTED NUN.
2.10 Extraordinary Points
The SII encoded only the upper extraordinary point, as 05C4 HEBREW
MARK UPPER DOT. A character for the lower dot could be added, although
it appears only a few times.
Agreed. Although this latter character is rare, it is in regular andThe SII encoded only the upper extraordinary point, as 05C4 HEBREW
MARK UPPER DOT. A character for the lower dot could be added, although
it appears only a few times.
undisputed use in a widely used text, and so probably does need to be
encoded.
2.11 Broken Letters
There are in the text of the Bible a few instances of the mutilated or
broken letters Vav and Qof. I suggest this could be handled by mark-up.
Perhaps. The problem is that known mark-up languages have as far as IThere are in the text of the Bible a few instances of the mutilated or
broken letters Vav and Qof. I suggest this could be handled by mark-up.
know no mechanisms for handling requests for variant glyphs. But Unicode
does have such a mechanism, variation selectors. This could be a case
where it would be suitable to use them.
2.12 Number Dots
An old practice was to use dots and double dots above to distinguish
"non words", such as numbers and acronyms. For several centuries this
usage has been replaced by the use of Geresh and Gershayim.
The dots always appear on unpointed texts. There is nothing special
about them, so the existing Unicodes 0307 and 0308 could be used.
Agreed.An old practice was to use dots and double dots above to distinguish
"non words", such as numbers and acronyms. For several centuries this
usage has been replaced by the use of Geresh and Gershayim.
The dots always appear on unpointed texts. There is nothing special
about them, so the existing Unicodes 0307 and 0308 could be used.
2.13 Shva Na vs. Shva Nah
The Hebrew vowel Shva has two meanings, known as Shva Na and Shva Nah.
Some printers desire to make the difference visible.
This is analogous to similar issues in other languages, for example
the dual meaning of s in the English word summers, and should be
handled by mark-up.
It seems to me that this is more analogous to the diacritics added toThe Hebrew vowel Shva has two meanings, known as Shva Na and Shva Nah.
Some printers desire to make the difference visible.
This is analogous to similar issues in other languages, for example
the dual meaning of s in the English word summers, and should be
handled by mark-up.
English words in some dictionaries etc to indicate and disambiguate
their pronunciation, which can be encoded in Unicode. And again this is
not something which any known mark-up can handle. So, at least if this
is at all a regular practice and the glyphs used are at all
standardised, a good case can be made for encoding a second separate
combining character here, or possibly using a variation selector. If it
is not at all standardised, "A private convention by the editor of a
single book ... indicates the PUA."
2.14 Qamats Gadol vs. Qamats Qatan
The Hebrew vowel Qamats has two meanings, known as Qamats Gadol and
Qamats Qatan. Some printers desire to make the difference visible.
This is analogous to similar issues in other languages, for example
the dual meaning of s in the English word summers, and should be
handled by mark-up.
Same comment as on 2.13.The Hebrew vowel Qamats has two meanings, known as Qamats Gadol and
Qamats Qatan. Some printers desire to make the difference visible.
This is analogous to similar issues in other languages, for example
the dual meaning of s in the English word summers, and should be
handled by mark-up.
2.15 Vav with Dagesh vs. Shuruq
The Hebrew vowel Shuruq looks exactly like a Vav with Dagesh. Unicode,
following the SII, unified them.
Some people want to see a code for the Vav Shuruq, considering it a
separate vowel. Since there is no known typographical difference I see
no reason to do so.
I agree. But according to GKC p.55 note 2 there should actually be aThe Hebrew vowel Shuruq looks exactly like a Vav with Dagesh. Unicode,
following the SII, unified them.
Some people want to see a code for the Vav Shuruq, considering it a
separate vowel. Since there is no known typographical difference I see
no reason to do so.
typographical difference: "/WÄw/ with /DageÅ¡/ (×ÖŒ) cannot in our printed
texts be distinguished from /wÄw/ pointed as /ŠûrÄq/ (×ÖŒ); in the latter
case the point should stand higher up."
2.16 Hiriq Male
A vowel Hiriq followed by a silent Yod is called Hiriq Male.
Some people want to see a code for Hiriq Male, considering it a
separate vowel. Since there is no known typographical difference I see
no reason to do so.
Agreed.A vowel Hiriq followed by a silent Yod is called Hiriq Male.
Some people want to see a code for Hiriq Male, considering it a
separate vowel. Since there is no known typographical difference I see
no reason to do so.
3. References
Issues in the Representation of Pointed Hebrew in Unicode, Second
draft, Peter Kirk, August 2003,
http://www.qaya.org/academic/hebrew/Issues-Hebrew-Unicode.doc
http://www.qaya.org/academic/hebrew/Issues-Hebrew-Unicode.html.
I intend to make some minor updates to this, which I will post at theIssues in the Representation of Pointed Hebrew in Unicode, Second
draft, Peter Kirk, August 2003,
http://www.qaya.org/academic/hebrew/Issues-Hebrew-Unicode.doc
http://www.qaya.org/academic/hebrew/Issues-Hebrew-Unicode.html.
same location and perhaps also as a PDF.
...
You have not mentioned the following issue which I identified - the2.6. Punctuation issues
Certain Hebrew punctuation marks are not correctly described in
Unicode 4.0.
/Sof pasuq/ is used to indicate the end of a verse in the Hebrew Bible
(although it is missing from the end of a few verses in some texts,
and completely absent from some others) and as the equivalent of a
full stop in other Hebrew writings such as prayer books. It should be
classed and processed as Terminal_Punctuation and also as a character
which typically terminates a sentence.
/Paseq/ is also used only at the ends of words, and so should also be
classed as Terminal_Punctuation, but not as terminating a sentence.
/Paseq/ has two uses, one as part of the Hebrew accent system and the
other as a special textual mark in the Hebrew Bible; it is normally
found only in the Hebrew Bible and in quotations from it.
/Maqaf/ is also generally considered to be a word divider and so
should also be classed as Terminal_Punctuation. As its usage is
analogous to that of /hyphen/ and line breaks commonly occur after it
in pointed Hebrew texts, it should also be listed in Unicode Standard
Annex #14, along with /hyphen/, as a âbreak opportunity afterâ.
I hope the above will help you in revising and completing your draftCertain Hebrew punctuation marks are not correctly described in
Unicode 4.0.
/Sof pasuq/ is used to indicate the end of a verse in the Hebrew Bible
(although it is missing from the end of a few verses in some texts,
and completely absent from some others) and as the equivalent of a
full stop in other Hebrew writings such as prayer books. It should be
classed and processed as Terminal_Punctuation and also as a character
which typically terminates a sentence.
/Paseq/ is also used only at the ends of words, and so should also be
classed as Terminal_Punctuation, but not as terminating a sentence.
/Paseq/ has two uses, one as part of the Hebrew accent system and the
other as a special textual mark in the Hebrew Bible; it is normally
found only in the Hebrew Bible and in quotations from it.
/Maqaf/ is also generally considered to be a word divider and so
should also be classed as Terminal_Punctuation. As its usage is
analogous to that of /hyphen/ and line breaks commonly occur after it
in pointed Hebrew texts, it should also be listed in Unicode Standard
Annex #14, along with /hyphen/, as a âbreak opportunity afterâ.
document.
--
Peter Kirk
***@qaya.org (personal)
***@qaya.org (work)
http://www.qaya.org/
------------------------ Yahoo! Groups Sponsor ---------------------~-->
KnowledgeStorm has over 22,000 B2B technology solutions. The most comprehensive IT buyers' information available. Research, compare, decide. E-Commerce | Application Dev | Accounting-Finance | Healthcare | Project Mgt | Sales-Marketing | More
http://us.click.yahoo.com/IMai8D/UYQGAA/cIoLAA/8FfwlB/TM
---------------------------------------------------------------------~->
To Unsubscribe, send a blank message to: unicode-***@yahooGroups.com
This mailing list is just an archive. The instructions to join the true Unicode List are on http://www.unicode.org/unicode/consortium/distlist.html
Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/