Regular Expressions
One of the unsung successes in standardization of computer science has been the
Regular Expressions (RE), a language for specifying text search strings.
This practical language is in every computer language, word processor, and
text processing tools like the Unix's grep
.
A regular expression is an algebraic notation for characterizing a set of strings.
They are particularly useful for searching in texts, when we have a pattern to search for
and a corpus of texts to search through.
A regex search function will search through the corpus returning all teexts that match the pattern.
The corpus can be a single document or a collection.
For example, the Unix cmd tool grep
takes a regex and
returns every line of the input document that matches the pattern in regex.
RegEx comes in many variants. In his article we will be discussing
extended regex.
Basic Patterns
References:
-
"Speech & Language Processing" by Jurafsky et al.; 2021 ↩