Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the ability to specify a list of candidate encodings when guessing encoding (#36951) #208550

Merged
merged 9 commits into from
Jun 5, 2024

Conversation

yutotnh
Copy link
Contributor

@yutotnh yutotnh commented Mar 25, 2024

Corresponds to #36951 in Feature request.

This request adds a new setting files.candidateGuessEncodings.

Setting this will narrow down the encodings guessed by the charset auto-guesser (jschardet).

The threshold is also lowered from 0.2 to 0 so that the encoding can be guessed as much as possible.

This is to ensure that the encoding is returned even when the confidence level is small, as in the following added text

  • src/vs/workbench/services/textfile/test/node/encoding/fixtures/some.shiftjis.1.txt

In case it cannot be determined, files.encoding is respected

@Tyriar Tyriar assigned bpasero and unassigned Tyriar Mar 25, 2024
@yutotnh yutotnh changed the title Allow to configure a list of encodings to use when guessing #36951 Mar 30, 2024
@bpasero bpasero added this to the June 2024 milestone Jun 5, 2024
@bpasero
Copy link
Member

bpasero commented Jun 5, 2024

@yutotnh thanks for this, I have made some changes on top of yours.

One thing I would like to discuss outside of this PR is your change of threshold from 0.2 to 0. I reverted it back to how it was because I felt it was not part of solving the issue #36951

Can you elaborate why you want to change the threshold and should that maybe be another new setting?

@bpasero bpasero enabled auto-merge (squash) June 5, 2024 09:38
@bpasero bpasero merged commit 3ab7948 into microsoft:main Jun 5, 2024
6 checks passed
@peminator
Copy link

peminator commented Jun 6, 2024

just installed insiders and have two bugs, one unrelated is trying to open files by rightclick now freezes file explorer window, menu never pops up, maybe some vscode insiders or something related preventing it get menu items list...

second, the guess DOES NOT WORK AT ALL for windows1250, tried add to list alone, add pair with also utf8 in candidate json list, reverse the order.. it always opens as utf8 - adding file for test, zipped to avoid modification on saving here
__textSample-windows1250.zip

the file always opens as utf, the detection of the windows1250 never passes. maybe lower the required score in jschardet if using this candidate list ?

@bpasero
Copy link
Member

bpasero commented Jun 6, 2024

@peminator I can see a change in jschardet for your file:

3.0.0

[
  { encoding: 'windows-1252', confidence: 0.95 },
  { encoding: 'ISO-8859-2', confidence: 0.5168639982162754 }
]

3.1.2

[
  { encoding: 'ISO-8859-2', confidence: 0.5168639982162754 },
  { encoding: 'windows-1252', confidence: 0.01 }
]

When in VS Code insiders I configure:

"files.candidateGuessEncodings": [
        "windows1252"
]

The file opens as Windows 1252 for me, can you try?

As for the change in jschardet, please report an issue to https://github.com/aadsm/jschardet

@peminator
Copy link

peminator commented Jun 6, 2024

bpasero seems as specific problem for windows1250 maybe...
u were right if i set candidate include 1252 it opes as it...

{
    "files.candidateGuessEncodings": [
        "windows1250", "windows1252"
    ],
    "files.autoGuessEncoding": true
}

this opens the file as 1252... but its not proper in it see here:
image

some characters ok, but some get the unknown symbol, and some get misinterpreted, see the last marked letter

  • should be ľ but is interpreted as ¾

i have entered the correct and expected windows1250 as first and it is skipped, wonder why,
bc thats the whole point, VsCode used to open if autogues various ISO8859-2 or Windows1252 sometimes.. but i really need to open as windows1250 if not valid utf-8

Windows-1250 is a Central European encoding, used in Slovak or Czech texts, also as default for older Windows before it went Utf8, so wonder how it gets skipped

@bpasero
Copy link
Member

bpasero commented Jun 6, 2024

@peminator I do not see windows1250 ever being detected with jschardet, so how would it be able to return it when you configure that setting?

@peminator
Copy link

peminator commented Jun 6, 2024

@aadsm any idea why the windows1250 not detected properly in this file ?
testFile 1250

@nfrance709
Copy link

nfrance709 commented Jun 6, 2024

bpasero seems as specific problem for windows1250 maybe... u were right if i set candidate include 1252 it opes as it...

{
    "files.candidateGuessEncodings": [
        "windows1250", "windows1252"
    ],
    "files.autoGuessEncoding": true
}

this opens the file as 1252... but its not proper in it see here: image

some characters ok, but some get the unknown symbol, and some get misinterpreted, see the last marked letter

  • should be ľ but is interpreted as ¾

i have entered the correct and expected windows1250 as first and it is skipped, wonder why, bc thats the whole point, VsCode used to open if autogues various ISO8859-2 or Windows1252 sometimes.. but i really need to open as windows1250 if not valid utf-8

Windows-1250 is a Central European encoding, used in Slovak or Czech texts, also as default for older Windows before it went Utf8, so wonder how it gets skipped

@peminator
When jschardet is unable to identify the charset, the value set in files.encoding is used and you probably have it set to utf8, I tested the latest Insider VS code build with the following settings and it does open your “ __textSample-windows1250.php” as windows1250

"files.autoGuessEncoding": true,
"files.encoding": "windows1250",
"files.candidateGuessEncodings": [
        "utf8",
        "windows1250"
    ]

The detection of utf8 seems more reliable then the other encodings, so when using utf8 and another encoding its best to have files.encoding set to the non utf8 encoding as that’s the fallback encoding that is used.

@peminator
Copy link

peminator commented Jun 6, 2024

yes nfrance709 that may seem right, then it opens up as expected in windows1250,
but with an UNWANTED SIDE-EFFECT of having new files saved as windows1250 by default,

i need exactly the oposite way, have all UTF except old files opened in windows1250 bc of two old projects I still have to manage (may store workkspace setting for it locally, but i also often open files directly on ftp server)

@peminator When jschardet is unable to identify the charset, the value set in files.encoding is used and you probably have it set to utf8, I tested the latest Insider VS code build with the following settings and it does open your “ __textSample-windows1250.php�� as windows1250

"files.autoGuessEncoding": true,
"files.encoding": "windows1250",
"files.candidateGuessEncodings": [
        "utf8",
        "windows1250"
    ]

The detection of utf8 seems more reliable then the other encodings, so when using utf8 and another encoding its best to have files.encoding set to the non utf8 encoding as that’s the fallback encoding that is used.

@nfrance709
Copy link

nfrance709 commented Jun 6, 2024

If you need to ability to have new files created as utf8 for example with the "files.encoding": "utf8" and also have a fallback for when jschardet can’t determine the encoding then I guess there would need to be one more option added such as "candidateGuessEncodingsFallback": “windows1250” or perhaps have the last value in the "files.candidateGuessEncodings" array be used as the fallback when jschardet can’t determine the encoding.

yes nfrance709 that may seem right, then it opens up as expected in windows1250, but with an UNWANTED SIDE-EFFECT of having new files saved as windows1250 by default,

i need exactly the oposite way, have all UTF except old files opened in windows1250 bc of two old projects I still have to manage (may store workkspace setting for it locally, but i also often open files directly on ftp server)

@peminator
Copy link

seems overcomplicated, al i now need would be to fix in jschardet to properly recognize also windows1250, i dont see any viable reason why it gets no probability listed in the result of guess, here above mentioned in #208550 (comment) ther is no mention of 1250... something fishy there

If you need to ability to have new files created as utf8 for example with the "files.encoding": "utf8" and also have a fallback for when jschardet can’t determine the encoding then I guess there would need to be one more option added such as "candidateGuessEncodingsFallback": “windows1250” or perhaps have the last value in the "files.candidateGuessEncodings" array be used as the fallback when jschardet can’t determine the encoding.

yes nfrance709 that may seem right, then it opens up as expected in windows1250, but with an UNWANTED SIDE-EFFECT of having new files saved as windows1250 by default,
i need exactly the oposite way, have all UTF except old files opened in windows1250 bc of two old projects I still have to manage (may store workkspace setting for it locally, but i also often open files directly on ftp server)

@formigoni
Copy link

I have found a particular situation in which Encoding Detection is not working properly.

Having VSCode closed, when I open a file with .sql extension that was encoded with Windows-1252 encoding, via Windows context menu (right click - Open with - VSCode Insiders) it is wrongly opened as UTF-8.
If I open the same file from inside VSCode Insiders its encoding is detected correctly.
If I open the same file via Windows context menu, but having VSCode Insiders previously opened, the encoding is detected correctly.
If I change the file extension to .txt the issue doesn't happen.
Here are my settings:

	"files.autoGuessEncoding": true,
	"files.candidateGuessEncodings": [
		"windows1252",
		"utf8"
	],

I'm attaching the file that I'm using to perform the tests. Since I can't upload a file with .sql extension, it needs to be renamed so that one can replicate the scenario.
test.txt

@peminator
Copy link

peminator commented Jul 18, 2024

♥ Perfect! ♥
Theres a nice setting for it too, im thrilled, sucesfully tested with a windows1250 central european text file, opening over WinSCP from FTP server.

Great work everyone adding to this, both vscode part, and also improving jschardet lib to be better in recognizing win1250.
With setting guess order for the 1250 first, and utf as default, i finally have, what needed for long zears. All new files default to utf, and any old files finally properly open as expected in windows1250 encoding

image

@vs-code-engineering vs-code-engineering bot locked and limited conversation to collaborators Jul 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
8 participants