2. Strings & Texts
Almost every useful program involve some kind of text processing
whether it is data or generating output.
Here we discuss about challenges involving text manipulation searching
, substitution
,
lexing
, parsing
etc.
Many of these problems can be easily solved with built-in methods.
While more complicated operations might require the use of regular expressions.
2.1: Splitting string on any of multiple delimeters
Problem
You need to split the string into fields but the delimiters (and the space around them) aren't consistent throughtout the string.
Solution
The str.split()
method of string objects is really meant for very simple use cases
and doesn't allow for multiple delimiters.
In case you need a bit more flexibility use
re.split()
instead as shown below.
1 2 3 4 5 6 |
|
re.split(r"[;,-\s]\s*", line)
the delimiter is either a
semi-colon (;
), a comma (,
), a hyphen (-
), a single space (
) or
any of these followed by any number of spaces.
Discussion
The re.split()
is useful because you can specify multiple patterns for the delimiter.
In the above solution using re.split()
, the separator/delimiter is either
a semi-colon (;
), a comma (,
), a hyphen (-
), a single space () or
any of these followed by any number of spaces. Whenever that pattern is found,
the entire match becomes a delimiter between whatever fileds lie on either side of the match.
The result is list of fileds just like
str.split()
When using
re.split()
, you need to be a bit careful if the regular expression
involves a capture group enclosed in parenthesis. If capture groups are used, then the
matched text (i.e. the delimiter/separator) is also included in the result as shown below.
1 2 3 4 5 6 7 8 |
|
Getting the split characters / delimiters / separators might be useful in certain contexts. For example, you may want ot use the split characters later on to reform the output string.
1 2 3 4 5 |
|
If you don't want the separator/delimiter to be in the result list
but still need to use the parenthesis
()
to group parts of the regular expression pattern;
then use a non-capture group specified by (?:...)
as shown below.
1 |
|