admin管理员组文章数量:1023204
How to get, for example..., a code point pattern like x-y\uxxxx\Uxxxxxxxxx
from the Connector Punctuation (Pc) category, for scanning ECMAScript 3/JavaScript identifiers?
Original question
I need help for verifying a valid character (code point) of a ECMA-262 (3º edition, 7.6) identifier for a lexical scanner.
Syntax quote
Identifier
::
IdentifierName
but notReservedWord
IdentifierName
::
IdentifierStart
IdentifierName
IdentifierPart
IdentifierStart
::UnicodeLetter
- $
- _
\# no need to check thisUnicodeEscapeSequence
IdentifierPart
::
IdentifierStart
UnicodeCombiningMark
UnicodeDigit
UnicodeConnectorPunctuation
UnicodeLetter
::
- any character in the Unicode categories “Uppercase letter (Lu)”, “Lowercase > letter (Ll)”, “Titlecase letter (Lt)”, “Modifier letter (Lm)”, “Other letter (Lo)”, or “Letter number (Nl)”.
UnicodeCombiningMark
::
- any character in the Unicode categories “Non-spacing mark (Mn)” or “Combining spacing mark (Mc)”
UnicodeDigit
::
- any character in the Unicode category “Decimal number (Nd)”
UnicodeConnectorPunctuation
::
- any character in the Unicode category “Connector punctuation (Pc)”
As you can see, it takes any character of certain categories.
I need to have all these possible characters, so my first step was to locate "Connector punctuation" on this Unicode 5.0 chart, but 0 matches and I believe I'm doing it the wrong way. So could someone help me?
How to get, for example..., a code point pattern like x-y\uxxxx\Uxxxxxxxxx
from the Connector Punctuation (Pc) category, for scanning ECMAScript 3/JavaScript identifiers?
Original question
I need help for verifying a valid character (code point) of a ECMA-262 (3º edition, 7.6) identifier for a lexical scanner.
Syntax quote
Identifier
::
IdentifierName
but notReservedWord
IdentifierName
::
IdentifierStart
IdentifierName
IdentifierPart
IdentifierStart
::UnicodeLetter
- $
- _
\# no need to check thisUnicodeEscapeSequence
IdentifierPart
::
IdentifierStart
UnicodeCombiningMark
UnicodeDigit
UnicodeConnectorPunctuation
UnicodeLetter
::
- any character in the Unicode categories “Uppercase letter (Lu)”, “Lowercase > letter (Ll)”, “Titlecase letter (Lt)”, “Modifier letter (Lm)”, “Other letter (Lo)”, or “Letter number (Nl)”.
UnicodeCombiningMark
::
- any character in the Unicode categories “Non-spacing mark (Mn)” or “Combining spacing mark (Mc)”
UnicodeDigit
::
- any character in the Unicode category “Decimal number (Nd)”
UnicodeConnectorPunctuation
::
- any character in the Unicode category “Connector punctuation (Pc)”
As you can see, it takes any character of certain categories.
I need to have all these possible characters, so my first step was to locate "Connector punctuation" on this Unicode 5.0 chart, but 0 matches and I believe I'm doing it the wrong way. So could someone help me?
Share Improve this question edited Nov 3, 2017 at 17:03 asked Mar 31, 2017 at 22:19 user5066707user50667072 Answers
Reset to default 7Unicode offers this tool for determining sets of characters. It uses regular expressions with property-value pairs enclosed in [::]
.
For all characters in Unicode 5 you want to do [:age=5.0:]
.
The rest are "general categories" (gc). So for example [:age=5.0:]&[:gc=Lu:]
will find all uppercase letters in Unicode 5 (gc=L
will find all letters in general).
For IdentifierStart you need [:age=5.0:]&[[:gc=L:][:gc=Nl:]\$_]
. For IdentifierPart you need [:age=5.0:]&[[:gc=L:][:gc=Nl:][:gc=Mn:][:gc=Mc:][:gc=Nd:][:gc=Pc:]\$_]
.
Unicode also has properties called ID_Start and ID_Continue but they don't include the same characters as your specifications.
Here is also an overview of all Unicode character properties.
I'm the OP. I'm actually using another approach for determining Unicode General Category. I made a tool for converting UnicodeData.txt
file into very optimal binaries: https://github./matheusdiasdesouzads/unicode-general-category/tree/master/data and a library for working with General Categories: https://github./matheusdiasdesouzads/unicode-general-category/tree/master/language-specific/javascript-nodejs
let cat = GeneralCategory.from(0x41);
cat.toString(); // 'Lu'
How to get, for example..., a code point pattern like x-y\uxxxx\Uxxxxxxxxx
from the Connector Punctuation (Pc) category, for scanning ECMAScript 3/JavaScript identifiers?
Original question
I need help for verifying a valid character (code point) of a ECMA-262 (3º edition, 7.6) identifier for a lexical scanner.
Syntax quote
Identifier
::
IdentifierName
but notReservedWord
IdentifierName
::
IdentifierStart
IdentifierName
IdentifierPart
IdentifierStart
::UnicodeLetter
- $
- _
\# no need to check thisUnicodeEscapeSequence
IdentifierPart
::
IdentifierStart
UnicodeCombiningMark
UnicodeDigit
UnicodeConnectorPunctuation
UnicodeLetter
::
- any character in the Unicode categories “Uppercase letter (Lu)”, “Lowercase > letter (Ll)”, “Titlecase letter (Lt)”, “Modifier letter (Lm)”, “Other letter (Lo)”, or “Letter number (Nl)”.
UnicodeCombiningMark
::
- any character in the Unicode categories “Non-spacing mark (Mn)” or “Combining spacing mark (Mc)”
UnicodeDigit
::
- any character in the Unicode category “Decimal number (Nd)”
UnicodeConnectorPunctuation
::
- any character in the Unicode category “Connector punctuation (Pc)”
As you can see, it takes any character of certain categories.
I need to have all these possible characters, so my first step was to locate "Connector punctuation" on this Unicode 5.0 chart, but 0 matches and I believe I'm doing it the wrong way. So could someone help me?
How to get, for example..., a code point pattern like x-y\uxxxx\Uxxxxxxxxx
from the Connector Punctuation (Pc) category, for scanning ECMAScript 3/JavaScript identifiers?
Original question
I need help for verifying a valid character (code point) of a ECMA-262 (3º edition, 7.6) identifier for a lexical scanner.
Syntax quote
Identifier
::
IdentifierName
but notReservedWord
IdentifierName
::
IdentifierStart
IdentifierName
IdentifierPart
IdentifierStart
::UnicodeLetter
- $
- _
\# no need to check thisUnicodeEscapeSequence
IdentifierPart
::
IdentifierStart
UnicodeCombiningMark
UnicodeDigit
UnicodeConnectorPunctuation
UnicodeLetter
::
- any character in the Unicode categories “Uppercase letter (Lu)”, “Lowercase > letter (Ll)”, “Titlecase letter (Lt)”, “Modifier letter (Lm)”, “Other letter (Lo)”, or “Letter number (Nl)”.
UnicodeCombiningMark
::
- any character in the Unicode categories “Non-spacing mark (Mn)” or “Combining spacing mark (Mc)”
UnicodeDigit
::
- any character in the Unicode category “Decimal number (Nd)”
UnicodeConnectorPunctuation
::
- any character in the Unicode category “Connector punctuation (Pc)”
As you can see, it takes any character of certain categories.
I need to have all these possible characters, so my first step was to locate "Connector punctuation" on this Unicode 5.0 chart, but 0 matches and I believe I'm doing it the wrong way. So could someone help me?
Share Improve this question edited Nov 3, 2017 at 17:03 asked Mar 31, 2017 at 22:19 user5066707user50667072 Answers
Reset to default 7Unicode offers this tool for determining sets of characters. It uses regular expressions with property-value pairs enclosed in [::]
.
For all characters in Unicode 5 you want to do [:age=5.0:]
.
The rest are "general categories" (gc). So for example [:age=5.0:]&[:gc=Lu:]
will find all uppercase letters in Unicode 5 (gc=L
will find all letters in general).
For IdentifierStart you need [:age=5.0:]&[[:gc=L:][:gc=Nl:]\$_]
. For IdentifierPart you need [:age=5.0:]&[[:gc=L:][:gc=Nl:][:gc=Mn:][:gc=Mc:][:gc=Nd:][:gc=Pc:]\$_]
.
Unicode also has properties called ID_Start and ID_Continue but they don't include the same characters as your specifications.
Here is also an overview of all Unicode character properties.
I'm the OP. I'm actually using another approach for determining Unicode General Category. I made a tool for converting UnicodeData.txt
file into very optimal binaries: https://github./matheusdiasdesouzads/unicode-general-category/tree/master/data and a library for working with General Categories: https://github./matheusdiasdesouzads/unicode-general-category/tree/master/language-specific/javascript-nodejs
let cat = GeneralCategory.from(0x41);
cat.toString(); // 'Lu'
本文标签: javascriptHow to get all Unicode characters from specific categoriesStack Overflow
版权声明:本文标题:javascript - How to get all Unicode characters from specific categories? - Stack Overflow 内容由热心网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://it.en369.cn/questions/1745585029a2157537.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论