You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: 9-regular-expressions/03-regexp-character-classes/article.md
+11-7Lines changed: 11 additions & 7 deletions
Original file line number
Diff line number
Diff line change
@@ -43,7 +43,7 @@ Most used are:
43
43
: A space symbol: that includes spaces, tabs, newlines.
44
44
45
45
`\w` ("w" is from "word")
46
-
: A "wordly" character: either a letter of English alphabet or a digit or an underscore. Non-english letters (like cyrillic or hindi) do not belong to `\w`.
46
+
: A "wordly" character: either a letter of English alphabet or a digit or an underscore. Non-Latin letters (like cyrillic or hindi) do not belong to `\w`.
47
47
48
48
For instance, `pattern:\d\s\w` means a "digit" followed by a "space character" followed by a "wordly character", like `"1 a"`.
Once again let's note that `pattern:\b` makes the searching engine to test for the boundary, so that `pattern:Java\b` finds `match:Java` only when followed by a word boundary, but it does not add a letter to the result.
117
117
118
-
Usually we use `\b` to find standalone English words. So that if we want `"Java"` language then `pattern:\bJava\b` finds exactly a standalone word and ignores it when it's a part of `"JavaScript"`.
118
+
Usually we use `\b` to find standalone English words. So that if we want `"Java"` language then `pattern:\bJava\b` finds exactly a standalone word and ignores it when it's a part of another word, e.g. it won't match `match:Java` in `subject:JavaScript`.
119
119
120
120
Another example: a regexp `pattern:\b\d\d\b` looks for standalone two-digit numbers. In other words, it requires that before and after `pattern:\d\d` must be a symbol different from `\w` (or beginning/end of the string).
```warn header="Word boundary doesn't work for non-Latin alphabets"
127
127
The word boundary check `\b` tests for a boundary between `\w` and something else. But `\w` means an English letter (or a digit or an underscore), so the test won't work for other characters (like cyrillic or hieroglyphs).
128
+
129
+
Later we'll come by Unicode character classes that allow to solve the similar task for different languages.
128
130
```
129
131
130
132
@@ -223,13 +225,14 @@ alert( "CS4".match(/CS.4/) ); // null, no match because there's no character for
223
225
224
226
Usually a dot doesn't match a newline character.
225
227
226
-
For instance, this doesn't match:
228
+
For instance, `pattern:A.B` matches `match:A`, and then `match:B` with any character between them, except a newline.
229
+
230
+
This doesn't match:
227
231
228
232
```js run
229
233
alert( "A\nB".match(/A.B/) ); // null (no match)
230
234
231
-
// a space character would match
232
-
// or a letter, but not \n
235
+
// a space character would match, or a letter, but not \n
233
236
```
234
237
235
238
Sometimes it's inconvenient, we really want "any character", newline included.
@@ -240,7 +243,6 @@ That's what `s` flag does. If a regexp has it, then the dot `"."` match literall
240
243
alert( "A\nB".match(/A.B/s) ); // A\nB (match!)
241
244
```
242
245
243
-
244
246
## Summary
245
247
246
248
There exist following character classes:
@@ -255,7 +257,9 @@ There exist following character classes:
255
257
256
258
...But that's not all!
257
259
258
-
Modern JavaScript also allows to look for characters by their Unicode properties, for instance:
260
+
The Unicode encoding, used by JavaScript for strings, provides many properties for characters, like: which language the letter belongs to (if a letter) it is it a punctuation sign, etc.
261
+
262
+
Modern JavaScript allows to use these properties in regexps to look for characters, for instance:
259
263
260
264
- A cyrillic letter is: `pattern:\p{Script=Cyrillic}` or `pattern:\p{sc=Cyrillic}`.
261
265
- A dash (be it a small hyphen `-` or a long dash `—`): `pattern:\p{Dash_Punctuation}` or `pattern:\p{pd}`.
0 commit comments