Scraping Part Of A Wikipedia Infobox

May 20, 2024 Post a Comment

I'm using Python 2.7, requests & BeautifulSoup to scrape approximately 50 Wikipedia pages. I've created a column in my dataframe that has partial URL's that relate to the name

Solution 1:

Rather than try and parse the HTML output, try and parse the raw MediaWiki source for the page; the first line that starts with | Length contains the information you are looking for:

url = 'http://en.wikipedia.org/wiki/' + xyz
resp = requests.get(url, params={'action': 'raw'})
page = resp.text
for line in page.splitlines():
    if line.startswith('| Length'):
       length = line.partition('=')[-1].strip()
       break

Demo:

>>>import requests>>>xyz = 'No_One_Knows'>>>url = 'http://en.wikipedia.org/wiki/' + xyz>>>resp = requests.get(url, params={'action': 'raw'})>>>page = resp.text>>>for line in page.splitlines():...if line.startswith('| Length'):...       length = line.partition('=')[-1].strip()...break...>>>print length
4:13 <small>(Radio edit)</small><br />4:38 <small>(Album version)</small>

You can further process this to extract the richer data here (the Radio edit vs. *Album version) as required.

Baca Juga

Python Developer

Scraping Part Of A Wikipedia Infobox

Solution 1:

Post a Comment for "Scraping Part Of A Wikipedia Infobox"