Scraping Part Of A Wikipedia Infobox
I'm using Python 2.7, requests & BeautifulSoup to scrape approximately 50 Wikipedia pages. I've created a column in my dataframe that has partial URL's that relate to the name
Solution 1:
Rather than try and parse the HTML output, try and parse the raw MediaWiki source for the page; the first line that starts with | Length
contains the information you are looking for:
url = 'http://en.wikipedia.org/wiki/' + xyz
resp = requests.get(url, params={'action': 'raw'})
page = resp.text
for line in page.splitlines():
if line.startswith('| Length'):
length = line.partition('=')[-1].strip()
break
Demo:
>>>import requests>>>xyz = 'No_One_Knows'>>>url = 'http://en.wikipedia.org/wiki/' + xyz>>>resp = requests.get(url, params={'action': 'raw'})>>>page = resp.text>>>for line in page.splitlines():...if line.startswith('| Length'):... length = line.partition('=')[-1].strip()...break...>>>print length
4:13 <small>(Radio edit)</small><br />4:38 <small>(Album version)</small>
You can further process this to extract the richer data here (the Radio edit vs. *Album version) as required.
Post a Comment for "Scraping Part Of A Wikipedia Infobox"