Python Split At Tag Regex
I'm trying to split these lines: Next stop Into: [''] ['
Solution 1:
Using lookarounds and a capture group to keep the text after splitting:
re.split(r'(?<=>)(.+?)(?=<)', '<label>Olympic Games</label>')
Solution 2:
This regex works for me:
<(label|title)>([^<]*)</(label|title)>
or, as cwallenpoole suggested:
<(label|title)>([^<]*)</(\1)>
I've used http://www.regexpal.com/
I have used three capturing groups, if you don't need them, simply remove the ()
What is wrong about your regex <\*>
is that is matching only one thing: <*>
. You have scaped *
using \*
, so what you are saying is:
- Match any text with
<
, then a*
and then a>
.
Solution 3:
Data:
line = """<label>Olympic Games</label>
<title>Next stop</title>"""
With look-ahead / look-behind assertions with re.findall
:
import re
pattern = re.compile("(<.*(?<=>))(.*)((?=</)[^>]*>)")
print re.findall(pattern, line)
# [('<label>', 'Olympic Games', '</label>'), ('<title>', 'Next stop', '</title>')]
Without look-ahead / look-behind assertions, just by capturing groups, with re.findall
:
pattern= re.compile("(<[^>]*>)(.*)(</[^>]*>)")
print re.findall(pattern, line)
# [('<label>', 'Olympic Games', '</label>'), ('<title>', 'Next stop', '</title>')]
Solution 4:
If you don't mind punctuation, here is a quick non-regex alternative using itertools.groupby
.
Code
import itertools as it
defsplit_at(iterable, pred, keep_delimter=False):
"""Return an iterable split by a delimiter."""if keep_delimter:
return [list(g) for k, g in it.groupby(iterable, pred)]
return [list(g) for k, g in it.groupby(iterable, pred) if k]
Demo
>>> words = "Lorem ipsum ..., consectetur ... elit, sed do eiusmod ...".split(" ")
>>> pred = lambda x: "elit"in x
>>> split_at(words, pred, True)
[['Lorem', 'ipsum', '...,', 'consectetur', '...'],
['elit,'],
['sed', 'do', 'eiusmod', '...']]
>>> words = "Lorem ipsum ..., consectetur ... elit, sed do eiusmod ...".split(" ")
>>> pred = lambda x: "consect"in x
>>> split_at(words, pred, True)
[['Lorem', 'ipsum', '...,'],
['consectetur'],
['...', 'elit,', 'sed', 'do', 'eiusmod', '...']]
Post a Comment for "Python Split At Tag Regex"