Regex Expression For A String
I want to split the string in python. Sample string: Hi this is ACT I. SCENE 1 and SCENE 2 and this is ACT II. SCENE 1 and SCENE 2 and more into the following list: ['Hi this i
Solution 1:
If I understand your requirements correctly, you may use the following pattern:
(?:ACT|SCENE).+?\d+|\S.*?(?=\s?(?:ACT|SCENE|$))
Demo.
Breakdown:
(?: # Startof a non-capturing group.
ACT|SCENE # Matches either 'ACT'or'SCENE'.
) # Close the non-capturing group.
.+? # Matchesoneor more characters (lazy matching).
\d+ # Matchesoneor more digits.
| # Alternation (OR).
\S # Matches a non-whitespace character (to trim spaces).
.*? # Matches zero or more characters (lazy matching).
(?= # Startof a positive Lookahead (i.e., followed by...).
\s? # An optional whitespace character (to trim spaces).
(?:ACT|SCENE|$) # Followed by either 'ACT'or'SCENE'or the endof the string.
) # Close the Lookahead.
Python example:
import re
regex = r"(?:ACT|SCENE).+?\d+|\S.*?(?=\s?(?:ACT|SCENE|$))"
test_str = "Hi this is ACT I. SCENE 1 and SCENE 2 and this is ACT II. SCENE 1 and SCENE 2 and more"list = re.findall(regex, test_str)
print(list)
Output:
['Hi this is', 'ACT I. SCENE 1', 'and', 'SCENE 2', 'and this is', 'ACT II. SCENE 1', 'and', 'SCENE 2', 'and more']
Solution 2:
Here is a working script, albeit a bit hackish:
inp = "Hi this is ACT I. SCENE 1 and SCENE 2 and this is ACT II. SCENE 1 and SCENE 2 and more"
parts = re.findall(r'[A-Z]{2,}(?: [A-Z0-9.]+)*|(?![A-Z]{2})\w+(?: (?![A-Z]{2})\w+)*', inp)
print(parts)
This prints:
['Hi this is', 'ACT I. SCENE 1', 'and', 'SCENE 2', 'and this is', 'ACT II. SCENE 1',
'and', 'SCENE 2', 'and more']
An explanation of the regex logic, which uses an alternation to match one of two cases:
[A-Z]{2,} match TWO or more capital letters
(?: [A-Z0-9.]+)* followed by zero or more words, consisting only of
capital letters, numbers, or period
| OR
(?![A-Z]{2})\w+ match a word which does NOT start with two capital letters
(?: (?![A-Z]{2})\w+)* then match zero or more similar terms
Solution 3:
You can use re.findall
:
import re
s = 'Hi this is ACT I. SCENE 1 and SCENE 2 and this is ACT II. SCENE 1 and SCENE 2 and more'
new_s = list(map(str.strip, re.findall('[A-Z\d\s\.]{2,}|^[A-Z]{1}[a-z\s]+|[a-z\s]+', s)))
Output:
['Hi this is', 'ACT I. SCENE 1', 'and', 'SCENE 2', 'and this is', 'ACT II. SCENE 1', 'and', 'SCENE 2', 'and more']
Post a Comment for "Regex Expression For A String"