admin管理员组文章数量:1026925
I was looking into GNU tr in bash on Debian Linux. The regex engine appears to have a [:lower:] and [:upper:] shorthand. The regex matches on "lowercase" and "uppercase" letters. The definition of these is not trivial: is Ñ
an uppercase letter? (Examples here.)
It seems to map to an "islower" function which is defined in the C language, somehow.
://cplusplus/reference/clibrary/cctype/islower/
Notice that what is considered a letter may depend on the locale being used; In the default C locale, a lowercase letter is any of: a b c d e f g h i j k l m n o p q r s t u v w x y z.
For a detailed chart on what the different ctype functions return for each character of the standard ANSII character set, see the reference for the header.
.c#L392
I can't find where islower is defined, perhaps within a specific C implementation (e.g. gcc
).
It also appears to depend on the "locale". Does this occur at compile time, or live in runtime? .html
I was looking into GNU tr in bash on Debian Linux. The regex engine appears to have a [:lower:] and [:upper:] shorthand. The regex matches on "lowercase" and "uppercase" letters. The definition of these is not trivial: is Ñ
an uppercase letter? (Examples here.)
It seems to map to an "islower" function which is defined in the C language, somehow.
https://en.cppreference/w/c/string/byte/islower
http://web.archive./web/20120308171350/https://cplusplus/reference/clibrary/cctype/islower/
Notice that what is considered a letter may depend on the locale being used; In the default C locale, a lowercase letter is any of: a b c d e f g h i j k l m n o p q r s t u v w x y z.
For a detailed chart on what the different ctype functions return for each character of the standard ANSII character set, see the reference for the header.
https://github/coreutils/coreutils/blob/1f0bf8d7c4b7131c6a8762de02ea01affef4db65/src/tr.c#L392
I can't find where islower is defined, perhaps within a specific C implementation (e.g. gcc
).
It also appears to depend on the "locale". Does this occur at compile time, or live in runtime? https://docs.oracle/cd/E19253-01/817-2521/overview-1002/index.html
Share Improve this question asked Nov 16, 2024 at 15:53 Atomic TripodAtomic Tripod 3462 silver badges9 bronze badges 5 |1 Answer
Reset to default 3The determination of lower case letters, per locale, is commonly determined before compile time.
localeconv()
[formatting of numeric quantities] allows the dynamic changing of some locale attributes, but not the determination of lower case.
The locale may change with char *setlocale(int category, const char *locale);
At program startup, the equivalent of setlocale(LC_ALL, "C");
is executed.
At least 2 locales are defined:
"C"
: A minimal C environment. This is defined in the spec with'a' - 'z'
, and nothing else, as lower case letters.""
: Implementation's native environment.
Some implementations allow for dozens of different locales. Some only have the minimal 2 - which might use the same determination of lower case letters - so no functional difference.
Thus the behavior of islower()
can change during a program's run.
Soapbox C's locale is an initial attempt to localize code to various country/culture standards. Yet it is cumbersome, inadequate and incurs troubles with multi-threading. Proceed with caution.
I was looking into GNU tr in bash on Debian Linux. The regex engine appears to have a [:lower:] and [:upper:] shorthand. The regex matches on "lowercase" and "uppercase" letters. The definition of these is not trivial: is Ñ
an uppercase letter? (Examples here.)
It seems to map to an "islower" function which is defined in the C language, somehow.
://cplusplus/reference/clibrary/cctype/islower/
Notice that what is considered a letter may depend on the locale being used; In the default C locale, a lowercase letter is any of: a b c d e f g h i j k l m n o p q r s t u v w x y z.
For a detailed chart on what the different ctype functions return for each character of the standard ANSII character set, see the reference for the header.
.c#L392
I can't find where islower is defined, perhaps within a specific C implementation (e.g. gcc
).
It also appears to depend on the "locale". Does this occur at compile time, or live in runtime? .html
I was looking into GNU tr in bash on Debian Linux. The regex engine appears to have a [:lower:] and [:upper:] shorthand. The regex matches on "lowercase" and "uppercase" letters. The definition of these is not trivial: is Ñ
an uppercase letter? (Examples here.)
It seems to map to an "islower" function which is defined in the C language, somehow.
https://en.cppreference/w/c/string/byte/islower
http://web.archive./web/20120308171350/https://cplusplus/reference/clibrary/cctype/islower/
Notice that what is considered a letter may depend on the locale being used; In the default C locale, a lowercase letter is any of: a b c d e f g h i j k l m n o p q r s t u v w x y z.
For a detailed chart on what the different ctype functions return for each character of the standard ANSII character set, see the reference for the header.
https://github/coreutils/coreutils/blob/1f0bf8d7c4b7131c6a8762de02ea01affef4db65/src/tr.c#L392
I can't find where islower is defined, perhaps within a specific C implementation (e.g. gcc
).
It also appears to depend on the "locale". Does this occur at compile time, or live in runtime? https://docs.oracle/cd/E19253-01/817-2521/overview-1002/index.html
Share Improve this question asked Nov 16, 2024 at 15:53 Atomic TripodAtomic Tripod 3462 silver badges9 bronze badges 5- That is highly dependent on implementation. Traditionally it was common with an array, one element for each character in the full alphabet (so including control and non-printable characters, i.e. with 256 elements). Each element was a bit-mask, where a specific bit set meant that the character was a upper-case character or not. For common 8-bit encodings it might still be handled that way. – Some programmer dude Commented Nov 16, 2024 at 15:58
-
1
That's helpful, but
tr
has some definition ofislower
for any character I give it. How does it determine it? Is there an example implementation I could look at? – Atomic Tripod Commented Nov 16, 2024 at 15:59 - Remember that all GNU tools are open source, which means that the source is available to read. It's part of GNU coreutils whose source is available from this github repository. – Some programmer dude Commented Nov 16, 2024 at 16:05
-
And if it turns out that it's using the standard C
isupper
andislower
, then the source for those are available as well. – Some programmer dude Commented Nov 16, 2024 at 16:12 - There are files that contain the information for each locale installed on a system. See sourceware./glibc/wiki/Locales for an introduction. – Shawn Commented Nov 16, 2024 at 16:29
1 Answer
Reset to default 3The determination of lower case letters, per locale, is commonly determined before compile time.
localeconv()
[formatting of numeric quantities] allows the dynamic changing of some locale attributes, but not the determination of lower case.
The locale may change with char *setlocale(int category, const char *locale);
At program startup, the equivalent of setlocale(LC_ALL, "C");
is executed.
At least 2 locales are defined:
"C"
: A minimal C environment. This is defined in the spec with'a' - 'z'
, and nothing else, as lower case letters.""
: Implementation's native environment.
Some implementations allow for dozens of different locales. Some only have the minimal 2 - which might use the same determination of lower case letters - so no functional difference.
Thus the behavior of islower()
can change during a program's run.
Soapbox C's locale is an initial attempt to localize code to various country/culture standards. Yet it is cumbersome, inadequate and incurs troubles with multi-threading. Proceed with caution.
本文标签: regexHow does C determine whether a character is lower case (islower or isupper)Stack Overflow
版权声明:本文标题:regex - How does C determine whether a character is lower case (islower or isupper)? - Stack Overflow 内容由热心网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://it.en369.cn/questions/1745653620a2161465.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
tr
has some definition ofislower
for any character I give it. How does it determine it? Is there an example implementation I could look at? – Atomic Tripod Commented Nov 16, 2024 at 15:59isupper
andislower
, then the source for those are available as well. – Some programmer dude Commented Nov 16, 2024 at 16:12