vim-mirror

mirror of https://github.com/vim/vim.git synced 2026-02-01 11:34:23 +01:00

Files

Yasuhiro Matsumoto 2b184d4b97 patch 9.1.2124: blob2str() does not handle UTF-16 encoding

Problem:  blob2str() does not handle UTF-16 encoding
          (Hirohito Higashi)
Solution: Refactor the code and fix remaining issues, see below
          (Yasuhiro Matsumoto).

blob2str() function did not properly handle UTF-16/UCS-2/UTF-32/UCS-4
encodings with endianness suffixes (e.g., utf-16le, utf-16be, ucs-2le).
The encoding name was canonicalized too aggressively, losing the
endianness information needed by iconv.

This change include few fixes:

- Preserve the raw encoding name with endianness suffix for iconv calls
- Normalize encoding names properly: "ucs2be" → "ucs-2be", "utf16le" →
  "utf-16le"
- For multi-byte encodings (UTF-16/32, UCS-2/4), convert the entire blob
  first, then split by newlines

convert_string() cannot handle UTF-16 because it uses string_convert()
which expects NUL-terminated strings. UTF-16 contains 0x00 bytes within
characters (e.g., "H" = 0x48 0x00), causing premature termination.
Therefore, for UTF-16/32 encodings, the fix uses string_convert_ext()
with an explicit input length to convert the entire blob at once.

The code appends two NUL bytes (ga_append(&blob_ga, NUL) twice) because
UTF-16 requires a 2-byte NUL terminator (0x00 0x00), not a single-byte
NUL.

- src/strings.c: Add from_encoding_raw to preserve endianness, special
  handling for UTF-16/32 and UCS-2/4
- src/mbyte.c: Fix convert_setup_ext() to use == ENC_UNICODE instead of
  & ENC_UNICODE. The bitwise AND was incorrectly treating UTF-16/UCS-2
  (which have ENC_UNICODE + ENC_2BYTE etc.) as UTF-8, causing iconv
  setup to be skipped.

fixes:  #19198
closes: #19246

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Yasuhiro Matsumoto <mattn.jp@gmail.com>
Signed-off-by: Christian Brabandt <cb@256bit.org>

2026-01-31 15:59:01 +00:00