In the past few articles in the Python series, we’ve learned a lot about working with regular expressions in Python.
In this article, we’ll explain how we could use python regular expressions for a realistic task.
We’ll do a step by step walk through on how we can build Python data structures from formatted flat text files.
If you are new to Python regular expressions, the following two articles will help:
- Getting started with python reg-ex using re.match search findall
- Advanced python reg-ex examples – Multi-line, substitution, greedy/non-greedy matching
In this article, our example starts with some formatted flat text. This data could have came from a text file containing profile information for a dating site:
>>> rawProfiles = ''' ... Tim Fake, 1982/03/21, I like to ... eat, sleep and ... relax ... ... Lisa Test, 1990/05/12, I like long ... walks of the beach, watching sun-sets, ... and listening to slow jazz ... ''' >>>
The format of this text is:
<name>, <birth-date>, <description>
However, the description can span multiple lines, and each profile is separated by at least one blank line. We can use the split() method from the ‘re’ package to process this raw text. First we will separate each profile:
>>> profilesList = re.split(r'\n{2,}', rawProfiles) >>> profilesList ['\nTim Fake, 1982/03/21, I like to\neat, sleep and\nrelax', ' Lisa Test, 1990/05/12, I like long\nwalks of the beach, watching sun-sets,\nand listening to slow jazz \n']
The {} expansion characters specify a range of repetitions to match. In our case ‘\n{2,}’ says to match a series of at least 2 newline characters, but because we didn’t specify an upper limit, the series could be arbitrarily long. This corresponds to the format of the text. Remember we said that each profile would be separated by at least one blank line (i.e. 2 consecutive newline characters).
Now we have a list of raw profiles. Before we do anything else, lets take care of the stray newline characters dispersed throughout the profile. These come as a result of the fact that a profile could span multiple lines. For now we’ll just substitute them for a ‘ ‘ character using the sub() method:
>>> profilesList = [ re.sub(r'\n', ' ', profile) for profile in profilesList ] >>> profilesList [' Tim Fake, 1982/03/21, I like to eat, sleep and relax', ' Lisa Test, 1990/05/12, I like long walks of the beach, watching sun-sets, and listening to slow jazz ']
The next step is to separate each profile into its individual fields. We could do this using matching and grouping (see the previous article on regex basics), but I’m going to do this using the split() method a second time. (For a more detailed look at python list comprehensions, see my previous article on this topic (can you put a link here?))
>>> profilesList = [ re.split(r',', profile, maxsplit=2) for profile in profilesList ] >>> for profile in profilesList: ... print profile ... [' Tim Fake', ' 1982/03/21', ' I like to eat, sleep and relax'] [' Lisa Test', ' 1990/05/12', ' I like long walks of the beach, watching sun-sets, and listening to slow jazz ']
In the above, notice how because we specified the maxsplit keyword parameter, the split() method left the descriptions alone (even though they contain ‘,’ characters as well). The maxsplit parameter tells split() to perform at most that many splits, no matter how many matches are found. In our case, we told split() to only split the string on the first 2 ‘,’ characters (i.e. creating 3 fields).
Our example is really progressing. We’ve now got a list of user profiles, with each user profile broken up into it’s specified fields. However, the data is messy, there is some stray whitespace sprinkled throughout our profiles.
Let’s clean this up:
>>> profilesList = [ map(str.strip, profile) for profile in profilesList ] >>> for profile in profilesList: ... print profile ... ['Tim Fake', '1982/03/21', 'I like to eat, sleep and relax'] ['Lisa Test', '1990/05/12', 'I like long walks of the beach, watching sun-sets, and listening to slow jazz']
The python standard library map() function takes a function and a list, and applies the function to each element of the list. In our case, applying the string’s strip() method to each field of the profile. For more information, visit the official Python re docs.
Our example has come to an end. We have successfully structured our user profile data.
We could easily take this list and use it to instantiate User Profile objects within our system, display user profiles on a web-page, or persist profile data in a database.
Comments on this entry are closed.
This is most helpful. Thanks.
how do you do it if the text is in a file, without having to read the whole file in memory? E.g. the description could be very long, the file very large etc. Is there a way not to read chunks of the file in memory?
How do you export the result into CSV?
Thanks