Lazy Parse A Stateful, Multiline Per Record Data Stream In Python?
Here's how one file looks: BEGIN_META stuff to discard END_META BEGIN_DB header to discard data I wish to extract END_DB I'd like to be able
Solution 1:
Something like this might work:
import itertools
def chunks(it):
while True:
it = itertools.dropwhile(lambda x: 'BEGIN_DB' not in x, it)
it = itertools.dropwhile(lambda x: x.strip(), it)
next(it)
yield itertools.takewhile(lambda x: 'END_DB' not in x, it)
For example:
src = """
BEGIN_META
stuff
to
discard
END_META
BEGIN_DB
header
to
discard
1data I
1wish to
1extract
END_DB
BEGIN_META
stuff
to
discard
END_META
BEGIN_DB
header
to
discard
2data I
2wish to
2extract
END_DB
"""
src = iter(src.splitlines())
for chunk in chunks(src):
for line in chunk:
print line.strip()
print
Solution 2:
You can separate your functions more programmatically to make your programming logic make more sense and to make your code more modular and flexible. Try to stay away from saying something like
state = "some string"
Because what happens if in the future you want to add something to this module, then you need to know what parameters your variable "state" takes and what happens when it changes values. You're not guaranteed to remember this information and this can set you up for some hassles. Writing functions to mimic this behavior is cleaner and easier to implement.
def read_stdin():
with sys.stdin as f:
for line in f:
yield line
def search_line_for_start_db(line):
if "BEGIN DB" in line:
search_db_for_info()
def search_db_for_info()
while "END_DB" not in new_line:
new_line = read_line.next()
if not new_line.strip():
# Put your information somewhere
raw_tables.append(line)
read_line = read_stdin()
raw_tables = []
while True:
try:
search_line_for_start_db(read_line.next())
Except: #Your stdin stream has finished being read
break #end your program
Post a Comment for "Lazy Parse A Stateful, Multiline Per Record Data Stream In Python?"