2. Strings & Texts

Almost every useful program involve some kind of text processing whether it is data or generating output. Here we discuss about challenges involving text manipulation searching, substitution, lexing, parsing etc.

Many of these problems can be easily solved with built-in methods. While more complicated operations might require the use of regular expressions.

2.1: Splitting string on any of multiple delimeters

Problem

You need to split the string into fields but the delimiters (and the space around them) aren't consistent throughtout the string.

Solution

The str.split() method of string objects is really meant for very simple use cases and doesn't allow for multiple delimiters.

In case you need a bit more flexibility use re.split() instead as shown below.

import re

line = "vinay kumar, ram; lakshmi,           devi         anil-kumar-ram"

re.split(r"[;,-\s]\s*", line)   
## ["vinay", "kumar", "ram", "lakshmi", "devi", "anil", "kumar", "ram"]

Here in the above re.split(r"[;,-\s]\s*", line) the delimiter is either a semi-colon (;), a comma (,), a hyphen (-), a single space () or any of these followed by any number of spaces.

Discussion

The re.split() is useful because you can specify multiple patterns for the delimiter. In the above solution using re.split(), the separator/delimiter is either a semi-colon (;), a comma (,), a hyphen (-), a single space () or any of these followed by any number of spaces. Whenever that pattern is found, the entire match becomes a delimiter between whatever fileds lie on either side of the match.

The result is list of fileds just like str.split()

When using re.split(), you need to be a bit careful if the regular expression involves a capture group enclosed in parenthesis. If capture groups are used, then the matched text (i.e. the delimiter/separator) is also included in the result as shown below.

import re

line = "vinay kumar, ram; lakshmi,           devi         anil-kumar-ram"

fields = re.split(r"(;|,|-|\s)\s*", line)

print(fields)
## ["vinay", " ", "kumar", ",", "ram", ";", "laksmi", ",", "devi", " ", "anil", "-", "kumar", "-", "ram"]

Getting the split characters / delimiters / separators might be useful in certain contexts. For example, you may want ot use the split characters later on to reform the output string.

values = fields[::2]                ## ["vinay", "kumar", "ram", "lakshmi", "devi", "anil", "kumar", "ram"]
delimiters = fields[1::2] + [""]    ## [" ", ",", ";", "-", " ", "-", "-", ""]

## reform the line using the same delimiters
"".join(v+d for v, d in zip(values, delimiters))    ## "vinay kumar,ram;lakshmi,devi anil-kumar-ram"

If you don't want the separator/delimiter to be in the result list but still need to use the parenthesis () to group parts of the regular expression pattern; then use a non-capture group specified by (?:...) as shown below.

re.split(r"(?:;|,|-|\s)\s*", line)  ## ["vinay", "kumar", "ram", "lakshmi", "devi", "anil", "kumar", "ram"]