Discussion:
Ambiguous hyphenation cases with
fantasai
2014-07-22 14:03:33 UTC
Permalink
On 05/12/2014 12:43 AM, Håkan Save Hansson wrote:
> Hi fantasai,
>
> Regarding your answer to my second suggestion (if you are referring
> to James Clarks first answer):
>
> The problem is that the hyphenation system in itself can't decide how
> to change the spelling, without any "dictionary" functionality. It
> can't know if I meant "mat-tjuv" ("food thief" in Swedish) or "matt-tjuv"
> ("carpet thief") when I wrote "mat­tjuv". So there has to be a way
> to tell the hyphenation system that.

Hm. I don't think I have a solution for that problem. :/ Currently you'd
just have to not hyphenate that word.

CCing Unicode, in case anyone there has a solution

Up-reference: http://lists.w3.org/Archives/Public/www-style/2014Feb/0739.html

~fantasai
Christoph Päper
2014-07-22 16:14:11 UTC
Permalink
fantasai <***@inkedblade.net>:

>> The problem is that the hyphenation system in itself can't decide how
>> to change the spelling, without any "dictionary" functionality. It
>> can't know if I meant "mat-tjuv" ("food thief" in Swedish) or "matt-tjuv"
>> ("carpet thief") when I wrote "mat&shy;tjuv". So there has to be a way
>> to tell the hyphenation system that.

Imagine if there was also ‘matt·juv’ next to ‘mat·tjuv’ and ‘matt·tjuv’, or even ‘mat·ttjuv’.

> Hm. I don't think I have a solution for that problem. :/
> Currently you'd just have to not hyphenate that word.

Smart-font solution (OpenType, AFDKO syntax):

“mattjuv, matttjuv”

lookup tripleletters {
sub t' t' t by t;
}
feature rlig {
script latn;
language SWE exclude_dflt;
lookup tripleletters;
} rlig;

Combining Grapheme Joiner (U+034F, ‘CGJ’) could possibly be given an interpretation like this (XML syntax), but Zero-Width Non-Joiner (U+200C, ‘ZWNJ’) should probably not be repurposed:

“mattjuv, mat&#x34F;tjuv”

Possible Unicode solution with a new combining character that makes the preceding character or grapheme – I’m not sure which – invisible except at the end of a line:

“mattjuv, matt&#x2065;tjuv”

U+2065 – Combining Collapse or Reduplicating Soft Hyphen or so

All solutions require author education. The latter two require changes to existing software and specifications (including CSS), the former “just” updated fonts. The second solution would fall back gracefully to ‘mattjuv’, the others to ‘matttjuv’, maybe even with a .notdef glyph in there.

All of these approaches are too complicated for Joe Sixpack (or Jo Sexpack), so I don’t think that will work in practice, except in environments that already make sure to treat border cases like disambiguation of umlaut and diaeresis use of trema dots.

JFTR, Swedish is not the only language with this orthographic feature. The German orthography reform of 1996 did away with letter collapsing completely, probably for this very problem. Now there are instances of three times the same letter on the same line, which some consider ugly, but smart fonts can overcome most of the perceived problems by ligating the first two letters of such a sequence or by selecting an alternate glyph for the final one. The special treatment of the double-‘k’ grapheme was also abolished: It used to look like ‘ck’ – often a ligature – except at the end of the line where it showed its real face, ‘k-k’; now it’s always typed, encoded and displayed as ‘ck’ and cannot be separated. Theoretical graphemes ‘zz' and ‘hh’ still look like ‘tz’ and ‘ch’ respectively, whereof only the former may be split ‘t-z’.
Zack Weinberg
2014-07-22 18:16:41 UTC
Permalink
On Tue, Jul 22, 2014 at 12:14 PM, Christoph Päper
<***@crissov.de> wrote:
> fantasai <***@inkedblade.net>:
>
>>> The problem is that the hyphenation system in itself can't decide how
>>> to change the spelling, without any "dictionary" functionality. It
>>> can't know if I meant "mat-tjuv" ("food thief" in Swedish) or "matt-tjuv"
>>> ("carpet thief") when I wrote "mat&shy;tjuv". So there has to be a way
>>> to tell the hyphenation system that.
...
> “mattjuv, mat&#x34F;tjuv”
>
> Possible Unicode solution with a new combining character that makes the preceding character or grapheme – I’m not sure which – invisible except at the end of a line:
>
> “mattjuv, matt&#x2065;tjuv”
>
> U+2065 – Combining Collapse or Reduplicating Soft Hyphen or so

I think I'd prefer new tags to new magic entities. In TeX this would be

mat\discretionary{t-}{}{}tjuv

so maybe in HTML

mat<dbr before="t-">tjuv

also accepting after= and nobreak= attributes. It's verbose but it's
easier to remember, I think.

I'd also support a "hyphenation" CSS property with the same semantics
as TeX's \hyphenation{}, i.e.

hyphenation: "un-break-able" "mom-ent";

overrides the built-in hyphenation dictionary for the words
"unbreakable" and "moment" (within the selected elements; normally one
would put this on <body>).

For bonus points,

hyphenation: "mat[t-//]tjuv"

precise syntax to be bikeshedded.

> All solutions require author education.

Yah.

zw
Kess Vargavind
2014-07-22 21:17:00 UTC
Permalink
There actually is one simple solution that I sometimes use: do not contract
three consecutive same-letter consonants at all! That is, do like Icelandic
and write food thief as <mattjuv> and carpet thief as <matttjuv>. Then
there is no trouble hyphenating.

Yes, this goes against current spelling rules in Swedish, but it works. And
until there is better hyphenation support for corner cases like this
(either at character level or higher) that is how I have ‘solved’ it when
unable to do manual tweaking.

Would it be logical to add a character similar to U+00AD SOFT HYPHEN (shy)
that says “you can break me here, but unless you do please skip the
previous character (however such would be defined in a case like this)”?
Such that <matt[SHY-LIKE-CHAR]tjuv> is either rendered <mattjuv> or broken
up as <matt-tjuv>.

Kess


2014-07-22 16:03 GMT+02:00 fantasai <***@inkedblade.net>:

> On 05/12/2014 12:43 AM, HÃ¥kan Save Hansson wrote:
>
>> Hi fantasai,
>>
>> Regarding your answer to my second suggestion (if you are referring
>> to James Clarks first answer):
>>
>> The problem is that the hyphenation system in itself can't decide how
>> to change the spelling, without any "dictionary" functionality. It
>> can't know if I meant "mat-tjuv" ("food thief" in Swedish) or "matt-tjuv"
>> ("carpet thief") when I wrote "mat&shy;tjuv". So there has to be a way
>> to tell the hyphenation system that.
>>
>
> Hm. I don't think I have a solution for that problem. :/ Currently you'd
> just have to not hyphenate that word.
>
> CCing Unicode, in case anyone there has a solution
>
> Up-reference: http://lists.w3.org/Archives/Public/www-style/2014Feb/0739.
> html
>
> ~fantasai
> _______________________________________________
> Unicode mailing list
> ***@unicode.org
> http://unicode.org/mailman/listinfo/unicode
>
Jörg Knappen
2014-07-25 11:35:13 UTC
Permalink
<html><head></head><body><div style="font-family: Verdana;font-size: 12.0px;"><div>
<div>With TeX and LaTeX there is an elegant solution.</div>

<div>&nbsp;</div>

<div>TeX has the primitive &#92;discretionary{prebreak}{postbreak}{nobreak}, which spells out like</div>

<div>&#92;discretionary{t-}{}{}</div>

<div>for the insertion of an additional t at hyphenation. It also handles cases like the traditional german hyphenation</div>

<div>of ck as k-k with</div>

<div>&#92;dicscretionary{c-}{}{k}</div>

<div>
<div>&nbsp;</div>

<div>The Babel system (inspired by german.sty) includes nifty shorthands like &quot;t and &quot;c for this cases.</div>

<div>&nbsp;</div>

<div>The semantics of U+00AD (SOFT HYPHEN) is too primitive to implement this kind of behaviour, the same is true for &amp;shy; in HTML.</div>

<div>&nbsp;</div>

<div>--J&ouml;rg Knappen</div>
</div>

<div>&nbsp;
<div name="quote" style="margin:10px 5px 5px 10px; padding: 10px 0 10px 10px; border-left:2px solid #C3D9E5; word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;">
<div style="margin:0 0 10px 0;"><b>Gesendet:</b>&nbsp;Dienstag, 22. Juli 2014 um 16:03 Uhr<br/>
<b>Von:</b>&nbsp;fantasai &lt;***@inkedblade.net&gt;<br/>
<b>An:</b>&nbsp;&quot;H&aring;kan Save Hansson&quot; &lt;***@edison.se&gt;, &quot;www-***@w3.org&quot; &lt;www-***@w3.org&gt;, Unicode &lt;***@unicode.org&gt;<br/>
<b>Betreff:</b>&nbsp;Ambiguous hyphenation cases with</div>

<div name="quoted-content">On 05/12/2014 12:43 AM, H&aring;kan Save Hansson wrote:<br/>
&gt; Hi fantasai,<br/>
&gt;<br/>
&gt; Regarding your answer to my second suggestion (if you are referring<br/>
&gt; to James Clarks first answer):<br/>
&gt;<br/>
&gt; The problem is that the hyphenation system in itself can&#39;t decide how<br/>
&gt; to change the spelling, without any &quot;dictionary&quot; functionality. It<br/>
&gt; can&#39;t know if I meant &quot;mat-tjuv&quot; (&quot;food thief&quot; in Swedish) or &quot;matt-tjuv&quot;<br/>
&gt; (&quot;carpet thief&quot;) when I wrote &quot;mat&amp;shy;tjuv&quot;. So there has to be a way<br/>
&gt; to tell the hyphenation system that.<br/>
<br/>
Hm. I don&#39;t think I have a solution for that problem. :/ Currently you&#39;d<br/>
just have to not hyphenate that word.<br/>
<br/>
CCing Unicode, in case anyone there has a solution<br/>
<br/>
Up-reference: <a href="http://lists.w3.org/Archives/Public/www-style/2014Feb/0739.html" target="_blank">http://lists.w3.org/Archives/Public/www-style/2014Feb/0739.html</a><br/>
<br/>
~fantasai<br/>
_______________________________________________<br/>
Unicode mailing list<br/>
***@unicode.org<br/>
<a href="http://unicode.org/mailman/listinfo/unicode" target="_blank">http://unicode.org/mailman/listinfo/unicode</a></div>
</div>
</div>
</div></div></body></html>
Brad Kemper
2014-07-26 19:18:18 UTC
Permalink
On Jul 22, 2014, at 2:17 PM, Kess Vargavind <***@gmail.com> wrote:
>
> That is, do like Icelandic and write food thief as <mattjuv> and carpet thief as <matttjuv>. Then there is no trouble hyphenating.

How about if anyone who steals a carpet in Sweden is just forced to eat it as punishment. Then, carpet=food. Problem solved!
Loading...