mirror of
https://github.com/vim/vim.git
synced 2026-02-01 11:34:23 +01:00
Problem: blob2str() does not handle UTF-16 encoding
(Hirohito Higashi)
Solution: Refactor the code and fix remaining issues, see below
(Yasuhiro Matsumoto).
blob2str() function did not properly handle UTF-16/UCS-2/UTF-32/UCS-4
encodings with endianness suffixes (e.g., utf-16le, utf-16be, ucs-2le).
The encoding name was canonicalized too aggressively, losing the
endianness information needed by iconv.
This change include few fixes:
- Preserve the raw encoding name with endianness suffix for iconv calls
- Normalize encoding names properly: "ucs2be" → "ucs-2be", "utf16le" →
"utf-16le"
- For multi-byte encodings (UTF-16/32, UCS-2/4), convert the entire blob
first, then split by newlines
convert_string() cannot handle UTF-16 because it uses string_convert()
which expects NUL-terminated strings. UTF-16 contains 0x00 bytes within
characters (e.g., "H" = 0x48 0x00), causing premature termination.
Therefore, for UTF-16/32 encodings, the fix uses string_convert_ext()
with an explicit input length to convert the entire blob at once.
The code appends two NUL bytes (ga_append(&blob_ga, NUL) twice) because
UTF-16 requires a 2-byte NUL terminator (0x00 0x00), not a single-byte
NUL.
- src/strings.c: Add from_encoding_raw to preserve endianness, special
handling for UTF-16/32 and UCS-2/4
- src/mbyte.c: Fix convert_setup_ext() to use == ENC_UNICODE instead of
& ENC_UNICODE. The bitwise AND was incorrectly treating UTF-16/UCS-2
(which have ENC_UNICODE + ENC_2BYTE etc.) as UTF-8, causing iconv
setup to be skipped.
fixes: #19198
closes: #19246
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Yasuhiro Matsumoto <mattn.jp@gmail.com>
Signed-off-by: Christian Brabandt <cb@256bit.org>