Closed Bug 862292 Opened 11 years ago Closed 4 years ago

Use UTF-8 for all outgoing email

Categories

(MailNews Core :: Composition, task, P2)

Tracking

(thunderbird_esr78 wontfix)

RESOLVED FIXED
82 Branch
Tracking Status
thunderbird_esr78 --- wontfix

People

(Reporter: hsivonen, Assigned: rnons)

References

Details

(Keywords: intl)

Attachments

(4 files, 9 obsolete files)

15.34 KB, patch
mkmelin
: review+
Details | Diff | Splinter Review
114.32 KB, patch
rnons
: review+
Details | Diff | Splinter Review
1.07 KB, patch
mkmelin
: review+
Details | Diff | Splinter Review
2.76 KB, patch
rnons
: review+
Details | Diff | Splinter Review
Dealing with character encodings as a lot of complexity to code. Moreover, there is even more complexity arising from MailNews supporting more character encodings than what are healthy to support for the Web. Meanwhile, UTF-8 can represent everything that is possible to write in e-mail, was invented a bit over 20 years ago and has been broadly supported for at least 10 years or so.

Considering that virtually everyone has been able to receive UTF-8-encoded e-mail for years by now, I suggest making MailNews Core always send e-mail as UTF-8 so that the code for managing the outgoing encoding can be removed and, therefore, complexity, including user interface complexity, be reduced.
That's technically a duplicate of bug 224391 but sure worth revisiting after a substantial time has passed. Looking at bug 410333, there have been issues with specific locales, thus the default certainly will still have to be configurable by individual localizations. While mailnews.send_default_charset is defined in mailnews.js, it refers to messenger.properties which is local to each application (mail/ vs. suite/), so that's an application-specific decision.
Keywords: intl
tl;dr - Defaulting to UTF-8 is probably safe now; removing support for emitting non-UTF-8 might not be.

(In reply to Henri Sivonen (:hsivonen) from comment #0)
> Dealing with character encodings as a lot of complexity to code. Moreover,
> there is even more complexity arising from MailNews supporting more
> character encodings than what are healthy to support for the Web. Meanwhile,
> UTF-8 can represent everything that is possible to write in e-mail, was
> invented a bit over 20 years ago and has been broadly supported for at least
> 10 years or so.
> 
> Considering that virtually everyone has been able to receive UTF-8-encoded
> e-mail for years by now, I suggest making MailNews Core always send e-mail
> as UTF-8 so that the code for managing the outgoing encoding can be removed
> and, therefore, complexity, including user interface complexity, be reduced.

At this point, I'm looking at making RFC 2047 encoded words always use UTF-8, largely because the JS TextEncoder service I'm using only supports it, but also because it's painful enough to get the algorithm right with UTF-8; any other multibyte charset is annoying.

(In reply to rsx11m from comment #1)
> That's technically a duplicate of bug 224391 but sure worth revisiting after
> a substantial time has passed. Looking at bug 410333, there have been issues
> with specific locales, thus the default certainly will still have to be
> configurable by individual localizations. While
> mailnews.send_default_charset is defined in mailnews.js, it refers to
> messenger.properties which is local to each application (mail/ vs. suite/),
> so that's an application-specific decision.

I recall UTF-8 versus non-UTF-8 coming up before in mdat or tb-planning. As of 2008, UTF-8 support in Japan was insufficient enough to cause bug 448842 to be filed. In a conversation where I brought that bug up a few years later, it was suggested that this is no longer a major issue.

Rereading bug 224391, it seems the other area of possible contention is use of UTF-8 in Usenet. Citing http://www.crampe.eu.org/statfr/tout.html:
  36.58% :	Mozilla
  19.23% :	MesNews
  10.23% :	Microsoft Outlook Express
  9.47% :	G2
  5.84% :	MacSOUP
  2.68% :	Microsoft Windows Live Mail
  2.62% :	Pan
  1.72% :	Microsoft Windows Mail
  1.61% :	[40tude Dialog]
  1.31% :	slrn
  1.25% :	bleachbot
  1.23% :	Forte Agent 
  6.20% :	[everybody else]

Mozilla, G2, Microsoft * all support UTF-8 unless their architecture is dumb. MesNews has some mumbles in their documentation about HTML, so I suspect they support UTF-8 as well. MacSOUP's documentation lists that it added support for UTF-8 in 2.5, which appears to date to about the same time as OS X 10.1 (!). Pan appears to use UTF-8 internally; slrn and Forte Agent both appear to support UTF-8 as well. I can't find any documentation to suggest whether or not UTF-8 is supported in 40tude, and bleachbot is just a posting bot, so it doesn't count.

To sum it up: I'm willing to believe that there are no major barriers to defaulting to UTF-8. Removing the machinery that allows encoding to legacy charsets is probably not tenable at this point in time. I would suggest, if we do this, that we keep the ability to quickly undo the change until it has fully baked on the ESR branches.
The ability to support other encodings certainly should be retained. The user may want an explicit encoding different from UTF-8 or the default could be different for a specific locale based on the localizer's judgment. Thus, only changing mailnews.send_default_charset seems to be the way to go. Removing the messenger.properties dependency should work as long as localizers can override in all-l10n.js.

The other question is whether or not to change the handling of mailnews.view_default_charset in the process. It may be a bit more tricky here, especially given that a message may not have any charset definition at all, at which point defaulting to UTF-8 rather than a local encoding may be the wrong thing to do (most likely that's for different bug though, but should be considered to maintain parity between sending and receiving ends).
(In reply to Joshua Cranmer [:jcranmer] from comment #2)
> tl;dr - Defaulting to UTF-8 is probably safe now;

I guess even that would be progress.

> removing support for emitting non-UTF-8 might not be.

:-( Then we wouldn't be able to actually simplify code.

(In reply to rsx11m from comment #3)
> The
> user may want an explicit encoding different from UTF-8 

"May want to" is a terrible reason. This should be based on compat data.

> or the default could be different for a specific locale based on the localizer's judgment.

Again, things should be based on compat data instead of localizers second-guessing data.

> but
> should be considered to maintain parity between sending and receiving ends).

Why is parity needed there? I'm suggesting always sending out UTF-8 no matter what sort of message is being replied to. If the recipient could receive a fresh non-reply UTF-8 message, they can deal with UTF-8 replies, too.
My understanding is that many non-English locales treat UTF-encoded messages as inherently more likely to be spam than their language-specific encoding. This info may be a few years out of date, but if we're going to go down this road, we should be really sure that it's no longer the case.
(In reply to Henri Sivonen (:hsivonen) from comment #4)
> (In reply to Joshua Cranmer [:jcranmer] from comment #2)
> > tl;dr - Defaulting to UTF-8 is probably safe now;
> 
> I guess even that would be progress.
> 
> > removing support for emitting non-UTF-8 might not be.
> 
> :-( Then we wouldn't be able to actually simplify code.

The most complicated code with respect to charset encoding is preparing RFC 2047 encoded words, which I am already moving to be UTF-8-only. For everything else, the code basically looks like this:

let binaryData = charset-encode(charset, text);
[ If you get an exception, fall back to UTF-8 ]
let asciiData = qpEncode | base64Encode | percentEncode (binaryData);

So unless you're planning on ripping out all of our non-UTF-8 charset encoders, there isn't a whole lot of code-simplification going on.

> > but
> > should be considered to maintain parity between sending and receiving ends).
> 
> Why is parity needed there? I'm suggesting always sending out UTF-8 no
> matter what sort of message is being replied to. If the recipient could
> receive a fresh non-reply UTF-8 message, they can deal with UTF-8 replies,
> too.

A. I don't 100% trust libmime to work properly if we start using different charsets when including original message data (go figure)
B. There is more at concern here than just email--Usenet has slightly different conventions. When the message is clearly in the full thralls of MIME (i.e., it uses quoted-printable or base64 for message data), then there is almost no concern. If the message is using 8-bit non-UTF-8 body parts, there is a small chance that downstream people might be expecting that charset in particular and not UTF-8.

This is all speculative, though, and I definitely want real-world usage feedback before we decide to irrevocably pull the plug on making non-UTF-8 messages.
As an example of a preference other than UTF-8 in some locales: the GB18030 charset can encode any Unicode codepoint, but it is biased in favour of Chinese (while UTF-8 is biased in favour of ASCII and Latin); also, it must (by law) be available on every computer hardware & software sold in mainland China. If Thunderbird cannot send email in GB18030, it might quite well get banned from distribution in China.

I agree with jcranmer (comment #2): “Defaulting to UTF-8 is probably safe now; removing support for emitting non-UTF-8 might not be.”
(In reply to Tony Mechelynck [:tonymec] from comment #7)
> If Thunderbird cannot send email in GB18030, it might quite well get
> banned from distribution in China.

I think we shouldn't make decisions based on speculation about what regulation actually says. This should be checked with legal when we'd otherwise be ready to make output UTF-8-only. Specifically, [citation needed] for PRC regulation requiring GB18030 encoding support for software *output* (as opposed to supporting the particular set of characters however encoded or supporting the GB18030 encoding for input).
Google gives a bunch of matches, do you consider IANA authoritative enough?

(Quoting http://www.iana.org/assignments/charset-reg/GB18030)
>    GB18030 is a "mandatory" standard: starting September 1, 2001, all
>    operating systems sold in Mainland China must support this
>    standard.  (Embedded systems and PDAs are currently exempt.)
>    Eventually, end-user applications must also fully support the
>    GB18030 standard--mere UTF-8 support is not enough. [...]

Now you can argue what "support" means, but in the context of e-mail and news messages, I don't see why they would require it only on one end (receiving yes, sending no). Given that it's their answer to UTF-8 (and essentially their national version thereof), that interpretation wouldn't make much sense.

(In reply to Henri Sivonen (:hsivonen) from comment #4)
> > or the default could be different for a specific locale based on the localizer's judgment.
> 
> Again, things should be based on compat data instead of localizers
> second-guessing data.

The case discussed right now is part of the reason why localizers should be involved in this discussion. They should have a better understanding about culture and technical or legal issues that "compat data" may not register.
(In reply to Henri Sivonen (:hsivonen) from comment #8)
> I think we shouldn't make decisions based on speculation about what
> regulation actually says. This should be checked with legal when we'd
> otherwise be ready to make output UTF-8-only.

That is rather upside-down. First you have to figure out what are the factors are that need to be considered in a decision whether or not all locales can be forced to emit UTF-8 only, then you can go ahead and figure out which parts and how to remove supporting code. And, I absolutely don't see a need to rush any decision in this regard. It would be a major step that needs to be carefully considered in its consequences.
There's actually a Usenet thread going on right now that uses a lot of Unicode characters in place of regular ASCII text (for example:

Ḭ իԲνҿ ท
FWIW, the Russian Localization of Thunderbird defaults to UTF-8 for outgoing email already.
(In reply to Joshua Cranmer [:jcranmer] from comment #2)
> As
> of 2008, UTF-8 support in Japan was insufficient enough to cause bug 448842
> to be filed.

Gmail has eliminated their UTF-8 avoidance pref. They now send all outgoing email--even when the user uses Japanese-localized Gmail in a Japanese-localized browser from an IP address assigned to Japan and uses only ISO-2022-JP-encodable characters. This has been the case since at least May 2015.

Stuff like bug 1202401 keeps coming up. I think the Thunderbird developers' time (and mine when changing things on the Gecko side) would be better allocated to other things--as it could be if all the code dealing with UTF-8 avoidance in the sending side of mailnews was gone once and for all.
So in bug 998191, I landed a change that forced all RFC 2047 encoding to use UTF-8 [1], and bug 1324443 made the RFC 2231 encoding use it as well. In searching for bugs, I see no evidence that anyone has complained about this change or noticed it (although I haven't exactly been exhaustive in thinking up synonyms for "garbled subject on outgoing messages" for uninformed bug filers).

This strongly suggests that there is no software in any of our domains that cannot support UTF-8 at a technical level. That said, the question is still unanswered as to whether or not the policy configurations permit UTF-8. I'm uncomfortable making this sort of change without testing it first--and the easiest way to test is to make the change in the first place.

I'm open to testing this, so long as we have a plan in place for backing the change out quickly if the answer turns out to be "no, we can't do this." Discovering the answer "yes" would require a few months baking in the public release before I could breath the sigh of relief.

[1] I made this decision largely on the basis that emitting RFC 2047 correctly (i.e., getting the encoded-word breaks correct) is difficult enough as it is. Requiring UTF-8 means that I can take advantage of its self-synchronizing nature to avoid breaking in the middle of a character (which the old encoder was capable of doing, since it worked by deleting characters one by one until it fit).
(In reply to Joshua Cranmer [:jcranmer] from comment #15)
> This strongly suggests that there is no software in any of our domains that
> cannot support UTF-8 at a technical level.

Email gateways in Japan have now had two years of Gmail not sending ISO-2022-JP to deal with, so hopefully the gateways have been forced to deal by now (if they hadn't been already two years ago).

> That said, the question is still
> unanswered as to whether or not the policy configurations permit UTF-8.

Is this in reference to GB18030? As far as I can tell as a person not residing in China and not reading Chinese, the use of GB18030-the-encoding in protocol output is not of interest and the issue that matters is that software can handle the Unicode ranges that are applicable to languages used in China and are designated as such by GB18030-the-standard. (Gecko is able to.)

Note that UTF-only formats like JSON seem to be doing fine in this respect.
(In reply to Henri Sivonen (:hsivonen) from comment #16)
> (In reply to Joshua Cranmer [:jcranmer] from comment #15)
> > That said, the question is still
> > unanswered as to whether or not the policy configurations permit UTF-8.
> 
> Is this in reference to GB18030? As far as I can tell as a person not
> residing in China and not reading Chinese, the use of GB18030-the-encoding
> in protocol output is not of interest and the issue that matters is that
> software can handle the Unicode ranges that are applicable to languages used
> in China and are designated as such by GB18030-the-standard. (Gecko is able
> to.)

No, it's a reference to idiots who might do things like "mark all UTF-8 mail as spam" or NNTP servers that reject all UTF-8 posts. As I said, the best approach here is to just flip the switch while being prepared to unflip it if people start complaining.
I didn't notice sometimes we violate RFC. We should not do this until we fix bug 1427636.
Depends on: 1427636
No longer depends on: 1427636

We should probably do this now. Even Japanese set the mailnews.send_default_charset to UTF-8 since back in 2017. https://dxr.mozilla.org/l10n-central/source/ja/mail/chrome/messenger/messenger.properties#281
If they can do it, and Gmail has done it since 4-5 years already, it doesn't seem likely it would cause any problems.

Priority: -- → P2

Let's take this on now, and save ourselves from some pointless charset complications.

Assignee: nobody → remotenonsense

Can you please post to the relevant mailing lists announcing this change.

Attached patch 862292.patch (obsolete) — Splinter Review

This patch

  • removes the Text Encoding menu from compose window
  • removes mailnews.send_default_charset preference

Do we want to remove mailnews.reply_in_default_charset as well?

(In reply to Jorg K (CEST = GMT+2) from comment #21)

Can you please post to the relevant mailing lists announcing this change.

OK, I will post to the maildev list after my patch got reviewed, so that I know I understand the problem correctly.

Attachment #9169090 - Flags: review?(mkmelin+mozilla)
Attachment #9169090 - Flags: review?(Pidgeot18)
Status: NEW → ASSIGNED

Do we want to remove mailnews.reply_in_default_charset as well?

It seems to me that pref needs to be removed (in the sense of always replying in UTF-8) in order to end up it a state where all outgoing email is in UTF-8 and the non-UTF-8 email generation code paths become irrelevant.

Yes that should be removed as well.

Comment on attachment 9169090 [details] [diff] [review]
862292.patch

Review of attachment 9169090 [details] [diff] [review]:
-----------------------------------------------------------------

I don't think it's worth touching suite/ (except possible string changes). None of the is remotely in testable shape anyway...

::: mail/components/compose/content/MsgComposeCommands.js
@@ +4193,5 @@
>   * Return the full display string for any non-default text encoding of the
>   * current composition (friendly name plus official character set name).
>   * For the default text encoding, return empty string (""), to reduce
>   * ux-complexity, e.g. for the default Status Bar display.
> + * Note: The default is UTF-8.

We don't need this comment anymore. I didn't look a lot but couldn't the whole comFields.characterSet be removed?

::: mail/components/compose/content/messengercompose.xhtml
@@ +1728,5 @@
>            <menuitem id="dsnMenu" type="checkbox" label="&dsnMenu.label;" accesskey="&dsnMenu.accesskey;" oncommand="ToggleDSN(event.target)"/>
>            <menuseparator/>
>            <menu id="charsetMenu"
>                  data-l10n-id="menu-view-charset"
> +                hidden="true"

the whole menu should go

::: mailnews/compose/src/nsMsgCompose.cpp
@@ +1523,5 @@
>    if (NS_FAILED(rv)) {
>      nsCOMPtr<nsIPrefBranch> prefs(
>          do_GetService(NS_PREFSERVICE_CONTRACTID, &rv));
>      NS_ENSURE_SUCCESS(rv, rv);
>      nsString defaultCharset;

defaultCharset(u"UTF-8"_ns);

.... would be nicer. But I guess this will all go away since there is no defaultCharset
Attachment #9169090 - Flags: review?(mkmelin+mozilla)
Attachment #9169090 - Flags: review?(Pidgeot18)
Attached patch 862292.patch (obsolete) — Splinter Review
  • Remove mailnews.reply_in_default_charset and mailnews.disable_fallback_to_utf8
  • Remove characterSet from nsIMsgCompFields
  • Fix tests

Try run: https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=3a5a7916a054ec9d1d1096c98168254fcaf87cbc

Attachment #9169090 - Attachment is obsolete: true
Attachment #9169550 - Flags: review?(mkmelin+mozilla)
Attached patch 862292.patch (obsolete) — Splinter Review
Attachment #9169550 - Attachment is obsolete: true
Attachment #9169550 - Flags: review?(mkmelin+mozilla)
Attachment #9169551 - Flags: review?(mkmelin+mozilla)
Comment on attachment 9169551 [details] [diff] [review]
862292.patch

Review of attachment 9169551 [details] [diff] [review]:
-----------------------------------------------------------------

I'm surprised that https://searchfox.org/comm-central/source/mail/test/browser/composition/browser_charsetEdit.js doesn't need to change.

Also, keep in mind that the outgoing delsp processing was only introduced for ISO-2022-JP, so if you scrap that, the delsp processing can be removed, for example here:
https://searchfox.org/comm-central/rev/51b7d1ea0a96a8c4d52812ac3e27360b2da4b91c/mailnews/compose/src/nsMsgCompUtils.cpp#1491
See bug 653342 comment #209.

f- since this appears to be incomplete. Also, maybe a good idea to split the patch into parts: Like (just fantasising): 1) UI changes 2) backend changes 3) remove delsp 4) test changes, or any other split which may be appropriate.

::: mailnews/compose/test/unit/test_longLines.js
@@ +180,5 @@
>      },
>      longMultibyteLineCJK + " " + newline + newline // Expected body: The message without the tags.
>    );
>  
>    // Now a special test for ISO-2022-JP.

I think this comment should go as well.
Attachment #9169551 - Flags: feedback-

I don't think it's very useful to split up backend, UI and test changes. We could split up removal of delsp though.

Comment on attachment 9169551 [details] [diff] [review]
862292.patch

Review of attachment 9169551 [details] [diff] [review]:
-----------------------------------------------------------------

::: mailnews/compose/test/unit/test_autoReply.js
@@ +215,5 @@
>    // The text gets converted to UTF-8 (regardless of what it is) at some point.
>    // Suspect a bug with how BinaryInputStream handles the strings.
>    if (templateHdr.Charset == "windows-1252") {
>      // XXX: should really check for "��� åäö"
> +    if (!body.includes("åäö xlatin1")) {

I don't think this should change?
Old templates/drafts can have outdated charsets. We'll have to deal with that
Attachment #9169551 - Flags: review?(mkmelin+mozilla)

::: mailnews/compose/test/unit/test_autoReply.js
@@ +215,5 @@

// The text gets converted to UTF-8 (regardless of what it is) at some point.
// Suspect a bug with how BinaryInputStream handles the strings.
if (templateHdr.Charset == "windows-1252") {
// XXX: should really check for " åäö"

  • if (!body.includes("åäö xlatin1")) {

I don't think this should change?
Old templates/drafts can have outdated charsets. We'll have to deal with that

Yes, it's a bit confusing, but the change is needed. Take https://searchfox.org/comm-central/source/mailnews/test/data/template-utf8#26 for example, åäö xutf8 became "åäö xutf8" in the test file https://searchfox.org/comm-central/source/mailnews/compose/test/unit/test_autoReply.js#230. Maybe due to https://searchfox.org/comm-central/source/mailnews/compose/test/unit/test_autoReply.js#214-216

I can split up if that makes reviewing easier. But if you don't mind, I will not split then, and will remove delsp in a separate patch.

I'm surprised that https://searchfox.org/comm-central/source/mail/test/browser/composition/browser_charsetEdit.js doesn't need to change.

I thought it was irrelevant so I removed it, seems I'm wrong, will update.

Sorry, I missed the removal. Maybe it is irrelevant now, I'd have to check closer. What happens with you receive a message in windows-1252 or ISO-2022-JP and you go "edit as new message". That needs to be successfully converted into UTF-8 now. That case is a little similar to that test, no?

(In reply to Jorg K (CEST = GMT+2) from comment #28)

Also, keep in mind that the outgoing delsp processing was only introduced
for ISO-2022-JP, so if you scrap that, the delsp processing can be removed,
for example here:

I was going to say that's not entirely true, but then I went back searching through the bugs and remembered that we have this giant cesspool of interaction between format=flowed, plaintext serializer, charset, and Content-Transfer-Encoding selection that is overcomplicated and way too deferential to supporting legacy issues that may not matter anymore, and so what ought to be the design in a smart modern system has too little implications for our existing codebase. When we can watch all of this crap code go up in a merry blaze, I will be quite happy.

I haven't looked at this patch in detail yet (splinter review is crapping out on me), but I will ask that any time you assign "UTF-8" to a nsString only to immediately pass it into a UTF-16-to-ASCII (or UTF-8) conversion function, just strip out the middleman and assign it to a nsCString instead. Better yet, try to go forward and pass "UTF-8" literal strings to directly wherever the results are being used.

Attached patch 862292.patch (obsolete) — Splinter Review
  • Update browser_charsetEdit.js
  • Fix some comments and pass "UTF-8" directly in a few places
Attachment #9169551 - Attachment is obsolete: true
Attachment #9169732 - Flags: review?(mkmelin+mozilla)
Attached patch 862292-delsp.patch (obsolete) — Splinter Review

Remove delsp related code.

Attachment #9169733 - Flags: review?(mkmelin+mozilla)

Comment on attachment 9169733 [details] [diff] [review]
862292-delsp.patch

You missed https://searchfox.org/comm-central/rev/b3767e49a6acf1f66abd0a276bd71579eada4787/mailnews/compose/src/MessageSend.jsm#359 (that's new, I know, but Searchfox finds it). Actually, there is more in that file:
https://searchfox.org/comm-central/search?q=DelSp&path=&case=false&regexp=false

Not that the stuff in mimetpfl.cpp is for incoming delsp, so that needs to stay.

Attachment #9169733 - Flags: review?(mkmelin+mozilla)

Thanks, if it's OK with you, I'm planning to update MessageSend.jsm related stuff in bug 1211292, after this bug is closed.

Comment on attachment 9169733 [details] [diff] [review]
862292-delsp.patch

Sure. Sorry.

Attachment #9169733 - Flags: review?(mkmelin+mozilla)

Remove unnecessary charset and disallowBreaks arguments from GetSerialiserFlags.

Attachment #9169733 - Attachment is obsolete: true
Attachment #9169733 - Flags: review?(mkmelin+mozilla)
Attachment #9169939 - Flags: review?(mkmelin+mozilla)

This change breaks Chinese, Japanese and Korean languages (CJK), and probably some others.

The issue is that Unicode is broken for CJK. Early on they made the mistake of trying to merge many characters in these languages that are in fact not the same, so it becomes impossible to create a true Unicode font that works for more than one of the three.

It's a severe issue because it prevents people writing their names correctly and other basic stuff like that. Worse still the receiver of the email may see a different character on their screen to the one the sender wrote.

So until Unicode gets fixed for CJK Thunderbird should default to their preferred national character encodings. I don't know if it would be easier to have an allowlist or denylist of broken Unicode languages because I'm not sure how widespread this problem is. I seem to recall some Thai people complaining they couldn't render their names in Unicode a few years back.

Thai uses UTF-8 already so I can't think that's a problem. Japanese also uses UTF-8.
zh-TW uses UTF-8 as well. I'm assuming zh-CN uses gbk from old habit or more or less political reasons?

Korean uses EUC-KR, I don't know if there's a reason. If there's a reason, what does Gmail do for Korean?

https://dxr.mozilla.org/l10n-central/search?q=mailnews.send_default_charset&redirect=false

(In reply to Magnus Melin [:mkmelin] from comment #41)

Thai uses UTF-8 already so I can't think that's a problem.

It's not a problem. Han unification is irrelevant to Thai. The way Unicode analyzes Thai is based on the Thai national standard that existed before. There's no benefit from using the legacy encoding.

Japanese also uses UTF-8.

Moreover, absent a hint of how to resolve CJK ambiguity, Gecko resolves in favor of Japanese by default. This makes sense for platforms that have distinct font designs for different CJK locales, since this way kana and kanji match design-wise.

zh-TW uses UTF-8 as well.

So there's no change from this patch for Traditional Chinese.

I'm assuming zh-CN uses gbk from old habit or more or less political reasons?

There are potential relevance to font selection for incoming email that was labeled as gbk (if the CJK resolution hint from encoding even survives in Thunderbird the way it survives in Firefox; I didn't check). However, if the result is unsatisfactory, users can change the font Thunderbird uses to a font that resolves the issue. In any case, zh-CN users have been receiving UTF-8 email from non-Thunderbird senders for ages and, therefore, must have dealt with this already.

Korean uses EUC-KR, I don't know if there's a reason. If there's a reason, what does Gmail do for Korean?

Han characters are rarely used in Korean these days. They mostly appear as abbreviations in newspaper headlines and in parenthetical clarifications. I don't have statistics for email, but it's indicative that the Android input method for Korean doesn't even support the input of Han characters. Hangul doesn't have anything to unify with C or J, so there are no unification issues.

This is incorrect, Japan still uses a lot of Shift-JIS because of the Unicode issues and I believe China is BIG-5, but I'm not an expert on Chinese.

Again though, the issue is less of a problem for say Japanese people who only communicate with other Japanese people in Japanese. Where it becomes an issue is when they need to talk to a Chinese or Korean person.

Imagine if you booked a flight from Xiamen to Tokyo but the ticket had your name printed wrong because of Unicode. Now you have to explain to the staff that it's a character encoding issue, your browser displays Chinese Unicode and the airline uses Japanese Unicode and please look at the Latin transliteration and ignore the fact that there is a mismatch with my passport.

I remembered wrong, it was Bengali that was the issue: https://modelviewculture.com/pieces/i-can-text-you-a-pile-of-poo-but-i-cant-write-my-name

That was 2015, I don't know if they have fixed it and all relevant fonts have been updated and all relevant software that processes and displays Unicode has been fixed.

I suggest that Thunderbird displays a notification to the user asking them if they want to make the change, and warning them that it could cause potential issues with certain languages (Chinese, Japanese, Korean, Bengali, possibly others).

(In reply to kuro68k from comment #43)

This is incorrect, Japan still uses a lot of Shift-JIS because of the Unicode issues and I believe China is BIG-5, but I'm not an expert on Chinese.

I request that you refrain from arguing about this further on this bug report. The above sentence is indicative that it is not going to be productive.

I remembered wrong, it was Bengali that was the issue: https://modelviewculture.com/pieces/i-can-text-you-a-pile-of-poo-but-i-cant-write-my-name

That article is unfortunate in its attribution of issues. Unicode's analysis of Bengali is based on ISCII, which the federal government of India issued in 1988. The approach of analyzing non-Devanagari Brahmic scripts used in India as isomorphisms of Devanagari comes from there.

In any case, Thunderbird doesn't support any non-Unicode encodings for Bengali. Also, Thunderbird processes all text as Unicode internally, so the notion that not using UTF-8 for interchange somehow avoided Unicode is mistaken.

That article is unfortunate in its attribution of issues.

Not really, it either works or it doesn't and Unicode doesn't. Naturally people who can't write their own names in it care little where the blame is assigned, only that it gets fixed.

Anyway I realized that this isn't an issue as long as Thunderbird adds the right RFC 3282 header. Well, it is an issue because at least outside of HTML email you can't mix CJK, but it addresses the main problem here.

(In reply to Henri Sivonen (:hsivonen) from comment #42)

Japanese also uses UTF-8.

Moreover, absent a hint of how to resolve CJK ambiguity, Gecko resolves in favor of Japanese by default. This makes sense for platforms that have distinct font designs for different CJK locales, since this way kana and kanji match design-wise.

It is no longer true. We changed the precedence recently (see bug 1581822) and made it configurable (bug 1596875).

(In reply to kuro68k from comment #43)

This is incorrect, Japan still uses a lot of Shift-JIS because of the Unicode issues and I believe China is BIG-5, but I'm not an expert on Chinese.

Did you forget that this bug is about text encoding for e-mails? We (Japanese) do not use Shift-JIS encoding for e-mail messages. We used to use ISO-2022-JP exclusively.

Comment on attachment 9169732 [details] [diff] [review]
862292.patch

Review of attachment 9169732 [details] [diff] [review]:
-----------------------------------------------------------------

::: mailnews/compose/src/nsMsgCompose.cpp
@@ +2840,5 @@
>        mCharsetOverride || mAnswerDefaultCharset, false, mHtmlToQuote);
>  
>    mQuoteStreamListener->SetComposeObj(this);
>  
> +  rv = mQuote->QuoteMessage(msgURI, false, mQuoteStreamListener, "UTF-8", false,

This change doesn't seem correct to me. You're dropping the dependence on mCharsetOverride here, and there is a distinction in the callee between an empty "" and an explicit charset here.

From a quick glance of code, I believe that this would cause a reply to any message whose body is not UTF-8 to have corrupted text in the reply.

An interesting test to add to conduct (and add to our test suite) is replying inline to a non-UTF-8 message. A further test would be to reply inline to a message whose charset has been manually overridden by the UI.

(In reply to Joshua Cranmer [:jcranmer] from comment #47)

An interesting test to add to conduct (and add to our test suite) is
replying inline to a non-UTF-8 message. A further test would be to reply
inline to a message whose charset has been manually overridden by the UI.

Right, and, as I said in comment #32, we also need to cover the other code path, which is forward, "edit as new message", edit draft/template. You could have an old non-UTF-8 draft/template. Remember that composition has two very distinct code paths: One is reply and one is the other lot.

EDIT: Looks like there are some forwarding tests here: https://searchfox.org/comm-central/source/mail/test/browser/composition/browser_forwardUTF8.js which can be extended to non-UTF-8 messages.
There is also https://searchfox.org/comm-central/source/mail/test/browser/composition/browser_replyMultipartCharset.js which uses body-greek.eml. So you'd have to check what is covered and what is not and extend what is already there.

Attached patch 862292-part1.patch (obsolete) — Splinter Review

Updates according to comment 47. Fixed QuoteMessage argument, added browser_quoteMessage.js.

An interesting test to add to conduct (and add to our test suite) is replying inline to a non-UTF-8 message. A further test would be to reply inline to a message whose charset has been manually overridden by the UI.

I've tested replying to non-UTF8 message a few times before, it worked well. After your comment, I became aware of the Options > Quote Message menu, it did corrupt non-UTF8 text, thanks. I've fixed argument of QuoteMessage and added browser_quoteMessage.js for this case. For the case of manually changing the view charset to an incorrect one, then click reply, the text is corrupted as well. But I think that is expected, at least it's the same on nightly, do I need to do anything?

(In reply to Jorg K (CEST = GMT+2) from comment #32)

Sorry, I missed the removal. Maybe it is irrelevant now, I'd have to check closer. What happens with you receive a message in windows-1252 or ISO-2022-JP and you go "edit as new message". That needs to be successfully converted into UTF-8 now. That case is a little similar to that test, no?

Sorry, I forgot to reply this message earlier, click "edit as new message" on a ISO-2022-JP message works well.

(In reply to Jorg K (CEST = GMT+2) from comment #48)

Right, and, as I said in comment #32, we also need to cover the other code path, which is forward, "edit as new message", edit draft/template. You could have an old non-UTF-8 draft/template. Remember that composition has two very distinct code paths: One is reply and one is the other lot.

Seems to me browser_replyMultipartCharset.js already covered reply, edit as new and forward. And test_autoReply covered non-UTF8 template.

The case I missed was Options > Quote Message, which is fixed in this patch.

Attachment #9169732 - Attachment is obsolete: true
Attachment #9169732 - Flags: review?(mkmelin+mozilla)
Attachment #9170300 - Flags: review?(mkmelin+mozilla)
Attachment #9170300 - Attachment is patch: true
Attachment #9169939 - Attachment description: 862292-delsp.patch → 862292-part2-delsp.patch
Comment on attachment 9170300 [details] [diff] [review]
862292-part1.patch

Review of attachment 9170300 [details] [diff] [review]:
-----------------------------------------------------------------

::: mail/components/preferences/fonts.js
@@ +195,5 @@
>      return mailFixedWidthMessages.checked;
>    },
>  
>    /**
> +   * mailnews.view_default_charset is nsIPrefLocalizedString. Its default value

As a side note, I think this and force_charset_override should go as well but let's do that in another bug.

::: mail/test/browser/composition/browser_quoteMessage.js
@@ +6,5 @@
> + * Tests that in the compose window, Options > Quote Message works well for
> + * non-UTF8 encoding.
> + */
> +
> +let {

We usually use var for the top level variables.

::: mail/test/browser/composition/browser_replyMultipartCharset.js
@@ +126,5 @@
>  add_task(async function test_replyEditAsNewForward_enforceDefault() {
>    // Check that the default is honoured (inspired by bug 1323377).
> +  await subtest_replyEditAsNewForward_charset(1, "./multipart-charset.eml");
> +  await subtest_replyEditAsNewForward_charset(2, "./multipart-charset.eml");
> +  await subtest_replyEditAsNewForward_charset(3, "./multipart-charset.eml");

Looks like this doesn't really do anything useful anymore.
test_replyEditAsNewForward_charsetFromBody already checks the same things

::: mailnews/compose/src/nsMsgCompose.cpp
@@ +1255,2 @@
>      nsCString outCString;
> +    rv = nsMsgI18NConvertFromUnicode("UTF-8"_ns, msgBody, outCString, true);

I guess this should just be NS_ConvertUTF16toUTF8 now

@@ +1488,1 @@
>    NS_WARNING_ASSERTION(NS_SUCCEEDED(rv), "SetCharset() failed");

This could be just a normal NS_ENSURE_SUCCESS(rv, rv); now
If we can't set charset to be UTF-8, something else is quite wrong

::: mailnews/compose/src/nsMsgSend.cpp
@@ +1357,5 @@
> +
> +  bool isAsciiOnly = mozilla::IsAsciiNullTerminated(
> +      static_cast<const char16_t*>(bodyText.get()));
> +  rv = nsMsgI18NConvertFromUnicode(nsDependentCString(aCharset), bodyText,
> +                                   outCString, true);

NS_ConvertUTF16toUTF8? here and below

@@ +1359,5 @@
> +      static_cast<const char16_t*>(bodyText.get()));
> +  rv = nsMsgI18NConvertFromUnicode(nsDependentCString(aCharset), bodyText,
> +                                   outCString, true);
> +  if (mCompFields->GetForceMsgEncoding()) isAsciiOnly = false;
> +  mCompFields->SetBodyIsAsciiOnly(isAsciiOnly);

Maybe the forceMsgEncoding and isAsciiOnly are not something we need anymore?
Attachment #9170300 - Flags: review?(mkmelin+mozilla) → feedback+
Comment on attachment 9169939 [details] [diff] [review]
862292-part2-delsp.patch

Review of attachment 9169939 [details] [diff] [review]:
-----------------------------------------------------------------

Seems reasonable to me
Attachment #9169939 - Flags: review?(mkmelin+mozilla) → review+
Attached file 862292-part1.patch (obsolete) —

Updates according to comment 50.

  • use var
  • use NS_ConvertUTF16toUTF8
  • remove unused test

We usually use var for the top level variables.

Seems to me const is used more in m-c now.

Maybe the forceMsgEncoding and isAsciiOnly are not something we need anymore?

I think isAsciiOnly can be removed, maybe we should do it separately. But forceMsgEncoding is used to determine QP. https://searchfox.org/comm-central/source/mailnews/compose/src/nsMsgAttachmentHandler.cpp#334

Attachment #9170300 - Attachment is obsolete: true
Attachment #9171322 - Flags: review?(mkmelin+mozilla)
Attached patch 862292-part1.patch (obsolete) — Splinter Review

eslint

Attachment #9171322 - Attachment is obsolete: true
Attachment #9171322 - Flags: review?(mkmelin+mozilla)
Attachment #9171326 - Flags: review?(mkmelin+mozilla)
Comment on attachment 9171326 [details] [diff] [review]
862292-part1.patch

Review of attachment 9171326 [details] [diff] [review]:
-----------------------------------------------------------------

r=mkmelin with the below

Let's land it only after the merge next Tuesday so it can ride the full trains.
Please reference a successful try run before checkin-needed-tb

::: mailnews/compose/src/nsMsgSend.cpp
@@ +1349,5 @@
>      }
>    }
>  
>    nsCString attachment1 [details] [diff] [review]_body;
> +  // Convert body to UTF-8

not really a conversion anymore, so we can drop the comment.
Maybe you don't need attachment1 [details] [diff] [review]_body anymore either

@@ +1363,5 @@
> +  // this we need to do the charset conversion on this part separately
> +  if (!origHTMLBody.IsEmpty()) {
> +    nsCString newBody;
> +    newBody.Assign(NS_ConvertUTF16toUTF8(origHTMLBody));
> +    mOriginalHTMLBody = ToNewCString(newBody);

no need for newBody, and conversion, right?

mOriginalHTMLBody = NS_ConvertUTF16toUTF8(origHTMLBody).get() ?

@@ +1365,5 @@
> +    nsCString newBody;
> +    newBody.Assign(NS_ConvertUTF16toUTF8(origHTMLBody));
> +    mOriginalHTMLBody = ToNewCString(newBody);
> +  } else {
> +    mOriginalHTMLBody = ToNewCString(attachment1 [details] [diff] [review]_body);

= NS_ConvertUTF16toUTF8(bodyText).get()?
Attachment #9171326 - Flags: review?(mkmelin+mozilla) → review+
Type: defect → task

(In reply to Magnus Melin [:mkmelin] from comment #54)

mOriginalHTMLBody = NS_ConvertUTF16toUTF8(origHTMLBody).get() ?

= NS_ConvertUTF16toUTF8(bodyText).get()?

Wouldn't using .get() result in use-after-free? What would keep the pointee of mOriginalHTMLBody alive in that case?

Why isn't mOriginalHTMLBody an nsCString in the first place?

I think you're right. Looks like we should just make mOriginalHTMLBody an nsCString.

Updates according to comment 54.

Maybe you don't need attachment1_body anymore either

I think it's still needed since there is rv = SnarfAndCopyBody(attachment1_body, TEXT_HTML); later.

(In reply to Magnus Melin [:mkmelin] from comment #56)

I think you're right. Looks like we should just make mOriginalHTMLBody an nsCString.

I didn't make this change. There are several other char* and ToNewCStrings in nsMsgSend, I don't see the immediate benefit of only changing mOriginalHTMLBody to nsCString in this patch.

Try run: https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=32a5ce888a39968972d2b1be17efef

Attachment #9171326 - Attachment is obsolete: true
Attachment #9171848 - Flags: review+
Target Milestone: --- → 82 Branch

Pushed by geoff@darktrojan.net:
https://hg.mozilla.org/comm-central/rev/58159743bab4
Use UTF-8 for all outgoing email. r=mkmelin
https://hg.mozilla.org/comm-central/rev/0d9bc3ca1452
Remove delsp since UTF-8 is always used. r=mkmelin

Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED

There's a new failure in comm/mailnews/compose/test/unit/test_longLines.js on Windows, which seems likely to be related to this bug. Please check it out.

Example failure log:
https://treeherder.mozilla.org/logviewer.html#?job_id=314049833&repo=comm-central&lineNumber=4043

(It's there in your Try run too. I didn't spot it before landing.)

Flags: needinfo?(remotenonsense)

Fix test failure in test_longLines.

There is also a failure in browser_quoteMessage.js related to open menu on macOS, looking into it.

Flags: needinfo?(remotenonsense)
Attachment #9172093 - Flags: review?(mkmelin+mozilla)
Attachment #9172093 - Flags: review?(mkmelin+mozilla) → review+

Pushed by geoff@darktrojan.net:
https://hg.mozilla.org/comm-central/rev/a6bcebe11fbc
followup - Fix newline character in test_longLines. r=mkmelin DONTBUILD

Call goDoCommand("cmd_quoteMessage") directly on macOS to fix browser_quoteMessage.
Try run: https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=08fae1a373dc2579e3e59dad3a3deac5549a56ab

Attachment #9172165 - Flags: review?(mkmelin+mozilla)
Comment on attachment 9172165 [details] [diff] [review]
862292-part4-browser_quoteMessage.patch

Review of attachment 9172165 [details] [diff] [review]:
-----------------------------------------------------------------

::: mail/test/browser/composition/browser_quoteMessage.js
@@ +64,5 @@
> +    cwc.click_menus_in_sequence(cwc.e("optionsMenuPopup"), [
> +      { id: "menu_quoteMessage" },
> +    ]);
> +  } else {
> +    // Native menubar is used on macOS, didn't find a way to click it.

For the 3pane we many times go through the hamburger menu instead. But that's not there in the compose window so I guess this might be the only way yes.
Attachment #9172165 - Flags: review?(mkmelin+mozilla) → review+

Just found browser_quoteMessage still failed for macOS on Try server, will take another look.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---

Pushed by mkmelin@iki.fi:
https://hg.mozilla.org/comm-central/rev/e79e0ea42b97
followup - Workaround clicking menu on mac to make browser_quoteMessage pass. r=mkmelin

Status: REOPENED → RESOLVED
Closed: 4 years ago4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.