Skip to content Skip to sidebar Skip to footer

Lazy Parse A Stateful, Multiline Per Record Data Stream In Python?

Here's how one file looks: BEGIN_META stuff to discard END_META BEGIN_DB header to discard data I wish to extract END_DB I'd like to be able

Solution 1:

Something like this might work:

import itertools

def chunks(it):
    while True:
        it = itertools.dropwhile(lambda x: 'BEGIN_DB' not in x, it)
        it = itertools.dropwhile(lambda x: x.strip(), it)
        next(it)
        yield itertools.takewhile(lambda x: 'END_DB' not in x, it)

For example:

src = """
BEGIN_META
    stuff
    to
    discard
END_META
BEGIN_DB
    header
    to
    discard

    1data I
    1wish to
    1extract
 END_DB


BEGIN_META
    stuff
    to
    discard
END_META
BEGIN_DB
    header
    to
    discard

    2data I
    2wish to
    2extract
 END_DB
"""


src = iter(src.splitlines())
for chunk in chunks(src):
    for line in chunk:
        print line.strip()
    print

Solution 2:

You can separate your functions more programmatically to make your programming logic make more sense and to make your code more modular and flexible. Try to stay away from saying something like

state = "some string"

Because what happens if in the future you want to add something to this module, then you need to know what parameters your variable "state" takes and what happens when it changes values. You're not guaranteed to remember this information and this can set you up for some hassles. Writing functions to mimic this behavior is cleaner and easier to implement.

def read_stdin():
    with sys.stdin as f:
        for line in f:
            yield line

def search_line_for_start_db(line):
    if "BEGIN DB" in line:
        search_db_for_info()

def search_db_for_info()
    while "END_DB" not in new_line: 
        new_line = read_line.next()
        if not new_line.strip():
            # Put your information somewhere
            raw_tables.append(line)

read_line = read_stdin()
raw_tables = []
while True:
    try:
        search_line_for_start_db(read_line.next())
    Except: #Your stdin stream has finished being read
        break #end your program

Post a Comment for "Lazy Parse A Stateful, Multiline Per Record Data Stream In Python?"