javascript - Convert UTF-8 String with only 8 Bits per Character

admin管理员组
文章数量:1023803

I have a JavaScript string that contains characters that have a charCode greater than 255.

I want to be able to encode/decode that string into another string that has all its charCode less than or equal to 255.

There is no restriction on the characters (ex: can be non-printable).

I want a solution that is as fast as possible and that produces a string as small as possible.

It must also work for any UTF-8 character.

I found out that encodeURI does exactly that, but it seems that it takes a lot of space.

encodeURI('ĉ') === "%C4%89" // 6 bytes...

Is there anything better than encodeURI?

I have a JavaScript string that contains characters that have a charCode greater than 255.

I want to be able to encode/decode that string into another string that has all its charCode less than or equal to 255.

There is no restriction on the characters (ex: can be non-printable).

I want a solution that is as fast as possible and that produces a string as small as possible.

It must also work for any UTF-8 character.

I found out that encodeURI does exactly that, but it seems that it takes a lot of space.

encodeURI('ĉ') === "%C4%89" // 6 bytes...

Is there anything better than encodeURI?

Share Improve this question edited Jul 10, 2018 at 0:09 Grant Miller 29.1k16 gold badges156 silver badges170 bronze badges asked Jun 16, 2016 at 19:19 RainingChain 7,79210 gold badges38 silver badges69 bronze badges

Do you have any other requirements on the encoding, other than that there is no charCode greater than 255? Is it allowed to have quotation marks, spaces, non-printable characters, NUL characters? – Paul Commented Jun 16, 2016 at 19:22
No other requirements. The data is sent as binary. – RainingChain Commented Jun 16, 2016 at 19:24
Fast and as small as possible are somewhat mutually exclusive. You could try LZW pression of the string. Just how large is the string you want to press, and why do you need to press it? E.g. if it is for a GET request, perhaps you could use a POST request instead, which would transmit the bytes quite effectively. – Andrew Morton Commented Jun 16, 2016 at 19:32
You could convert each characters charcode to base 255 and then delimit them with the one unused character. – Paul Commented Jun 16, 2016 at 19:32
@AndrewMorton I'm using a pression library that encodes an object into a binary buffer. That library assumes each character of thestrings within the object fit in 1 byte. – RainingChain Commented Jun 16, 2016 at 19:35

| Show 6 more ments

3 Answers 3

Sorted by: Reset to default 2

What you want to do is encode your string as UTF8. Googling for how to do that in Javascript, I found http://monsur.hossa.in/2012/07/20/utf-8-in-javascript.html , which gives:

function encode_utf8( s ) {
  return unescape( encodeURIComponent( s ) );
}

function decode_utf8( s ) {
  return decodeURIComponent( escape( s ) );
}

or in short, almost exactly what you found already, plus unescaping the '%xx' codes to a byte.

You can get the ASCII value of a character with .charCodeAt(position). You can split a character into multiple characters using this.

First, get the char code for every character, by looping trough the string. Create a temporary empty string, and while the char code is higher than 255 of the current character, divide 255 from it, and put a ÿ (the 256th character of the extended ASCII table), then once it's under 255 use String.fromCharCode(charCode), to convert it to a character, and put it at the end of the temporary string, and at last, replace the character with this string.

function encode(string) {
    var result = [];
    for (var i = 0; i < string.length; i++) {
    var charCode = string.charCodeAt(i);
        var temp = "";
        while (charCode > 255) {
            temp += "ÿ";
            charCode -= 255;
        }
        result.push(temp + String.fromCharCode(charCode));
    }
    return result.join(",");
}

The above encoder puts a ma after every group, this could cause problems at decode, so we need to use the ,(?!,) regex to match the last ma from multiple mas.

function decode(string) {
    var characters = string.split(/,(?!,)/g);
    var result = "";
    for (var i = 0; i < characters.length; i++) {
        var charCode = 0;
        for (var j = 0; j < characters[i].length; j++) {
            charCode += characters[i].charCodeAt(j);
        }
        result += String.fromCharCode(charCode);
    }
    return result;
}

UTF-8 is already an encoding for unicode text that uses 8 bits per character. You can simply send the UTF-8 string over the wire.

Generally, JavaScript strings consist of UTF-16 characters.

For such strings, you can either encode each UTF-16 character as two 8-bit characters or use a dynamic length encoding such as UTF-8.

If you have many non-ASCII characters, the first might produce smaller results.

// See http://monsur.hossa.in/2012/07/20/utf-8-in-javascript.html
function encode_utf8(s) {
  return unescape(encodeURIComponent(s));
}

function decode_utf8(s) {
  return decodeURIComponent(escape(s));
}

function encode_fixed_length(s) {
  let length = s.length << 1,
      bytes = new Array(length);
  for (let i = 0; i < length; ++i) {
    let code = s.charCodeAt(i >> 1);
    bytes[i] = code >> 8;
    bytes[++i] = code & 0xFF;
  }
  return String.fromCharCode.apply(undefined, bytes);
}

function decode_fixed_length(s) {
  let length = s.length,
      chars = new Array(length >> 1);
  for (let i = 0; i < length; ++i) {
    chars[i >> 1] = (s.charCodeAt(i) << 8) + s.charCodeAt(++i);
  }
  return String.fromCharCode.apply(undefined, chars);
}

string_1 = "\u0000\u000F\u00FF";
string_2 = "\u00FF\u0FFF\uFFFF";

console.log(encode_fixed_length(string_1)); // "\x00\x00\x00\x0F\x00\xFF"
console.log(encode_fixed_length(string_2)); // "\x00\xFF\x0F\xFF\xFF\xFF"

console.log(encode_utf8(string_1));         // "\x00\x0F\xC3\xBF" 
console.log(encode_utf8(string_2));         // "\xC3\xBF\xE0\xBF\xBF\xEF\xBF\xBF"

Performance parison: See https://jsfiddle/r0d9pm25/1/

Results for 500000 iterations in Firefox 47:

6159.91ms encode_fixed_length()
7177.35ms encode_utf8()

I have a JavaScript string that contains characters that have a charCode greater than 255.

I want to be able to encode/decode that string into another string that has all its charCode less than or equal to 255.

There is no restriction on the characters (ex: can be non-printable).

I want a solution that is as fast as possible and that produces a string as small as possible.

It must also work for any UTF-8 character.

I found out that encodeURI does exactly that, but it seems that it takes a lot of space.

encodeURI('ĉ') === "%C4%89" // 6 bytes...

Is there anything better than encodeURI?

I have a JavaScript string that contains characters that have a charCode greater than 255.

I want to be able to encode/decode that string into another string that has all its charCode less than or equal to 255.

There is no restriction on the characters (ex: can be non-printable).

I want a solution that is as fast as possible and that produces a string as small as possible.

It must also work for any UTF-8 character.

I found out that encodeURI does exactly that, but it seems that it takes a lot of space.

encodeURI('ĉ') === "%C4%89" // 6 bytes...

Is there anything better than encodeURI?

Do you have any other requirements on the encoding, other than that there is no charCode greater than 255? Is it allowed to have quotation marks, spaces, non-printable characters, NUL characters? – Paul Commented Jun 16, 2016 at 19:22
No other requirements. The data is sent as binary. – RainingChain Commented Jun 16, 2016 at 19:24
Fast and as small as possible are somewhat mutually exclusive. You could try LZW pression of the string. Just how large is the string you want to press, and why do you need to press it? E.g. if it is for a GET request, perhaps you could use a POST request instead, which would transmit the bytes quite effectively. – Andrew Morton Commented Jun 16, 2016 at 19:32
You could convert each characters charcode to base 255 and then delimit them with the one unused character. – Paul Commented Jun 16, 2016 at 19:32
@AndrewMorton I'm using a pression library that encodes an object into a binary buffer. That library assumes each character of thestrings within the object fit in 1 byte. – RainingChain Commented Jun 16, 2016 at 19:35

| Show 6 more ments

3 Answers 3

Sorted by: Reset to default 2

What you want to do is encode your string as UTF8. Googling for how to do that in Javascript, I found http://monsur.hossa.in/2012/07/20/utf-8-in-javascript.html , which gives:

function encode_utf8( s ) {
  return unescape( encodeURIComponent( s ) );
}

function decode_utf8( s ) {
  return decodeURIComponent( escape( s ) );
}

or in short, almost exactly what you found already, plus unescaping the '%xx' codes to a byte.

You can get the ASCII value of a character with .charCodeAt(position). You can split a character into multiple characters using this.

function encode(string) {
    var result = [];
    for (var i = 0; i < string.length; i++) {
    var charCode = string.charCodeAt(i);
        var temp = "";
        while (charCode > 255) {
            temp += "ÿ";
            charCode -= 255;
        }
        result.push(temp + String.fromCharCode(charCode));
    }
    return result.join(",");
}

The above encoder puts a ma after every group, this could cause problems at decode, so we need to use the ,(?!,) regex to match the last ma from multiple mas.

function decode(string) {
    var characters = string.split(/,(?!,)/g);
    var result = "";
    for (var i = 0; i < characters.length; i++) {
        var charCode = 0;
        for (var j = 0; j < characters[i].length; j++) {
            charCode += characters[i].charCodeAt(j);
        }
        result += String.fromCharCode(charCode);
    }
    return result;
}

UTF-8 is already an encoding for unicode text that uses 8 bits per character. You can simply send the UTF-8 string over the wire.

Generally, JavaScript strings consist of UTF-16 characters.

For such strings, you can either encode each UTF-16 character as two 8-bit characters or use a dynamic length encoding such as UTF-8.

If you have many non-ASCII characters, the first might produce smaller results.

// See http://monsur.hossa.in/2012/07/20/utf-8-in-javascript.html
function encode_utf8(s) {
  return unescape(encodeURIComponent(s));
}

function decode_utf8(s) {
  return decodeURIComponent(escape(s));
}

function encode_fixed_length(s) {
  let length = s.length << 1,
      bytes = new Array(length);
  for (let i = 0; i < length; ++i) {
    let code = s.charCodeAt(i >> 1);
    bytes[i] = code >> 8;
    bytes[++i] = code & 0xFF;
  }
  return String.fromCharCode.apply(undefined, bytes);
}

function decode_fixed_length(s) {
  let length = s.length,
      chars = new Array(length >> 1);
  for (let i = 0; i < length; ++i) {
    chars[i >> 1] = (s.charCodeAt(i) << 8) + s.charCodeAt(++i);
  }
  return String.fromCharCode.apply(undefined, chars);
}

string_1 = "\u0000\u000F\u00FF";
string_2 = "\u00FF\u0FFF\uFFFF";

console.log(encode_fixed_length(string_1)); // "\x00\x00\x00\x0F\x00\xFF"
console.log(encode_fixed_length(string_2)); // "\x00\xFF\x0F\xFF\xFF\xFF"

console.log(encode_utf8(string_1));         // "\x00\x0F\xC3\xBF" 
console.log(encode_utf8(string_2));         // "\xC3\xBF\xE0\xBF\xBF\xEF\xBF\xBF"

Performance parison: See https://jsfiddle/r0d9pm25/1/

Results for 500000 iterations in Firefox 47:

6159.91ms encode_fixed_length()
7177.35ms encode_utf8()

本文标签： javascriptConvert UTF8 String with only 8 Bits per CharacterStack Overflow

版权声明：本文标题：javascript - Convert UTF-8 String with only 8 Bits per Character - Stack Overflow 内容由热心网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://it.en369.cn/questions/1745601824a2158501.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

369IT编程

javascript - Convert UTF-8 String with only 8 Bits per Character - Stack Overflow

3 Answers 3

3 Answers 3

更多相关文章