mirror of
https://github.com/VictoriaMetrics/VictoriaMetrics.git
synced 2024-12-16 00:41:24 +01:00
109 lines
5.3 KiB
Go
109 lines
5.3 KiB
Go
/*
|
||
Package uniseg implements Unicode Text Segmentation, Unicode Line Breaking, and
|
||
string width calculation for monospace fonts. Unicode Text Segmentation conforms
|
||
to Unicode Standard Annex #29 (https://unicode.org/reports/tr29/) and Unicode
|
||
Line Breaking conforms to Unicode Standard Annex #14
|
||
(https://unicode.org/reports/tr14/).
|
||
|
||
In short, using this package, you can split a string into grapheme clusters
|
||
(what people would usually refer to as a "character"), into words, and into
|
||
sentences. Or, in its simplest case, this package allows you to count the number
|
||
of characters in a string, especially when it contains complex characters such
|
||
as emojis, combining characters, or characters from Asian, Arabic, Hebrew, or
|
||
other languages. Additionally, you can use it to implement line breaking (or
|
||
"word wrapping"), that is, to determine where text can be broken over to the
|
||
next line when the width of the line is not big enough to fit the entire text.
|
||
Finally, you can use it to calculate the display width of a string for monospace
|
||
fonts.
|
||
|
||
# Getting Started
|
||
|
||
If you just want to count the number of characters in a string, you can use
|
||
[GraphemeClusterCount]. If you want to determine the display width of a string,
|
||
you can use [StringWidth]. If you want to iterate over a string, you can use
|
||
[Step], [StepString], or the [Graphemes] class (more convenient but less
|
||
performant). This will provide you with all information: grapheme clusters,
|
||
word boundaries, sentence boundaries, line breaks, and monospace character
|
||
widths. The specialized functions [FirstGraphemeCluster],
|
||
[FirstGraphemeClusterInString], [FirstWord], [FirstWordInString],
|
||
[FirstSentence], and [FirstSentenceInString] can be used if only one type of
|
||
information is needed.
|
||
|
||
# Grapheme Clusters
|
||
|
||
Consider the rainbow flag emoji: 🏳️🌈. On most modern systems, it appears as one
|
||
character. But its string representation actually has 14 bytes, so counting
|
||
bytes (or using len("🏳️🌈")) will not work as expected. Counting runes won't,
|
||
either: The flag has 4 Unicode code points, thus 4 runes. The stdlib function
|
||
utf8.RuneCountInString("🏳️🌈") and len([]rune("🏳️🌈")) will both return 4.
|
||
|
||
The [GraphemeClusterCount] function will return 1 for the rainbow flag emoji.
|
||
The Graphemes class and a variety of functions in this package will allow you to
|
||
split strings into its grapheme clusters.
|
||
|
||
# Word Boundaries
|
||
|
||
Word boundaries are used in a number of different contexts. The most familiar
|
||
ones are selection (double-click mouse selection), cursor movement ("move to
|
||
next word" control-arrow keys), and the dialog option "Whole Word Search" for
|
||
search and replace. This package provides methods for determining word
|
||
boundaries.
|
||
|
||
# Sentence Boundaries
|
||
|
||
Sentence boundaries are often used for triple-click or some other method of
|
||
selecting or iterating through blocks of text that are larger than single words.
|
||
They are also used to determine whether words occur within the same sentence in
|
||
database queries. This package provides methods for determining sentence
|
||
boundaries.
|
||
|
||
# Line Breaking
|
||
|
||
Line breaking, also known as word wrapping, is the process of breaking a section
|
||
of text into lines such that it will fit in the available width of a page,
|
||
window or other display area. This package provides methods to determine the
|
||
positions in a string where a line must be broken, may be broken, or must not be
|
||
broken.
|
||
|
||
# Monospace Width
|
||
|
||
Monospace width, as referred to in this package, is the width of a string in a
|
||
monospace font. This is commonly used in terminal user interfaces or text
|
||
displays or editors that don't support proportional fonts. A width of 1
|
||
corresponds to a single character cell. The C function [wcswidth()] and its
|
||
implementation in other programming languages is in widespread use for the same
|
||
purpose. However, there is no standard for the calculation of such widths, and
|
||
this package differs from wcswidth() in a number of ways, presumably to generate
|
||
more visually pleasing results.
|
||
|
||
To start, we assume that every code point has a width of 1, with the following
|
||
exceptions:
|
||
|
||
- Code points with grapheme cluster break properties Control, CR, LF, Extend,
|
||
and ZWJ have a width of 0.
|
||
- U+2E3A, Two-Em Dash, has a width of 3.
|
||
- U+2E3B, Three-Em Dash, has a width of 4.
|
||
- Characters with the East-Asian Width properties "Fullwidth" (F) and "Wide"
|
||
(W) have a width of 2. (Properties "Ambiguous" (A) and "Neutral" (N) both
|
||
have a width of 1.)
|
||
- Code points with grapheme cluster break property Regional Indicator have a
|
||
width of 2.
|
||
- Code points with grapheme cluster break property Extended Pictographic have
|
||
a width of 2, unless their Emoji Presentation flag is "No", in which case
|
||
the width is 1.
|
||
|
||
For Hangul grapheme clusters composed of conjoining Jamo and for Regional
|
||
Indicators (flags), all code points except the first one have a width of 0. For
|
||
grapheme clusters starting with an Extended Pictographic, any additional code
|
||
point will force a total width of 2, except if the Variation Selector-15
|
||
(U+FE0E) is included, in which case the total width is always 1. Grapheme
|
||
clusters ending with Variation Selector-16 (U+FE0F) have a width of 2.
|
||
|
||
Note that whether these widths appear correct depends on your application's
|
||
render engine, to which extent it conforms to the Unicode Standard, and its
|
||
choice of font.
|
||
|
||
[wcswidth()]: https://man7.org/linux/man-pages/man3/wcswidth.3.html
|
||
*/
|
||
package uniseg
|