Beautifulsoup Web Scraping Example



The Python libraries requests and Beautiful Soup are powerful tools for the job. If you like to learn with hands-on examples and you have a basic understanding of Python and HTML, then this tutorial is for you. In this tutorial, you’ll learn how to: Use requests and Beautiful Soup for scraping and parsing data from the Web. Beautiful Soup: Beautiful Soup is a library (a set of pre-writen code) that give us methods to extract data from websites via web scraping Web Scraping: A technique to extract data from websites. With that in mind, we are going to install Beautiful Soup to scrap a website, Best CD Price to fetch the data and store it into a.csv file. Beautiful Soup web scraping tutorial. Contribute to KeithGalli/web-scraping development by creating an account on GitHub. There is a simpler way, from my pov, that gets you there without selenium or mechanize, or other 3rd party tools, albeit it is semi-automated. Basically, when you login into a site in a normal way, you identify yourself in a unique way using your credentials, and the same identity is used thereafter for every other interaction, which is stored in cookies and headers, for a brief period of time. Some common examples are- for image processing or computer vision, we use OpenCV, for machine learning, we use TensorFlow and for plotting graphs, we use MatplotLib. When it comes to web scraping, one of the most commonly used libraries is BeautifulSoup. This library does not specifically scrape data from the internet, but in case you can get.

Python offers a lot of powerful and easy to use tools for scraping websites. One of Python's useful modules to scrape websites is known as Beautiful Soup.

In this example we'll provide you with a Beautiful Soup example, known as a 'web scraper'. This will get data from a Yahoo Finance page about stock options. It's alright if you don't know anything about stock options, the most important thing is that the website has a table of information you can see below that we'd like to use in our program. Below is a listing for Apple Computer stock options.

First we need to get the HTML source for the page. Beautiful Soup won't download the content for us, we can do that with Python's urllib module, one of the libraries that comes standard with Python.

Fetching the Yahoo Finance Page

Scraping

2
4
optionsUrl='http://finance.yahoo.com/q/op?s=AAPL+Options'

Scraping Html Data With Beautifulsoup




2
4
optionsUrl='http://finance.yahoo.com/q/op?s=AAPL+Options'


This code retrieves the Yahoo Finance HTML and returns a file-like object.

If you go to the page we opened with Python and use your browser's 'get source' command you'll see that it's a large, complicated HTML file. It will be Python's job to simplify and extract the useful data using the BeautifulSoup module. BeautifulSoup is an external module so you'll have to install it. If you haven't installed BeautifulSoup already, you can get it here.

Beautiful Soup Example: Loading a Page

The following code will load the page into BeautifulSoup:

2
soup=BeautifulSoup(optionsPage)

Beautiful Soup Example: Searching

Now we can start trying to extract information from the page source (HTML). We can see that the options have pretty unique looking names in the 'symbol' column something like AAPL130328C00350000. The symbols might be slightly different by the time you read this but we can solve the problem by using BeautifulSoup to search the document for this unique string.

Let's search the soup variable for this particular option (you may have to substitute a different symbol, just get one from the webpage):

2
[u'AAPL130328C00350000']

This result isn’t very useful yet. It’s just a unicode string (that's what the 'u' means) of what we searched for. However BeautifulSoup returns things in a tree format so we can find the context in which this text occurs by asking for it's parent node like so:

2
>>>soup.findAll(text='AAPL130328C00350000')[0].parent
<ahref='/q?s=AAPL130328C00350000'>AAPL130328C00350000</a>

We don't see all the information from the table. Let's try the next level higher.

Beautifulsoup Tutorial

2
>>>soup.findAll(text='AAPL130328C00350000')[0].parent.parent
<td><ahref='/q?s=AAPL130328C00350000'>AAPL130328C00350000</a></td>

And again.

2
>>>soup.findAll(text='AAPL130328C00350000')[0].parent.parent.parent
<tr><td nowrap='nowrap'><ahref='/q/op?s=AAPL&amp;amp;k=110.000000'><strong>110.00</strong></a></td><td><ahref='/q?s=AAPL130328C00350000'>AAPL130328C00350000</a></td><td align='right'><b>1.25</b></td><td align='right'><span id='yfs_c63_AAPL130328C00350000'><bstyle='color:#000000;'>0.00</b></span></td><td align='right'>0.90</td><td align='right'>1.05</td><td align='right'>10</td><td align='right'>10</td></tr>

Bingo. It's still a little messy, but you can see all of the data that we need is there. If you ignore all the stuff in brackets, you can see that this is just the data from one row.

2
4
[x.text forxiny.parent.contents]
foryinsoup.findAll('td',attrs={'class':'yfnc_h','nowrap':'})
Beautiful soup web scraping example free

This code is a little dense, so let's take it apart piece by piece. The code is a list comprehension within a list comprehension. Let's look at the inner one first:

foryinsoup.findAll('td',attrs={'class':'yfnc_h','nowrap':'})

Web Scraping With Python Pdf

This uses BeautifulSoup's findAll function to get all of the HTML elements with a td tag, a class of yfnc_h and a nowrap of nowrap. We chose this because it's a unique element in every table entry.

If we had just gotten td's with the class yfnc_h we would have gotten seven elements per table entry. Another thing to note is that we have to wrap the attributes in a dictionary because class is one of Python's reserved words. From the table above it would return this:

<td nowrap='nowrap'><a href='/q/op?s=AAPL&amp;amp;k=110.000000'><strong>110.00</strong></a></td>

We need to get one level higher and then get the text from all of the child nodes of this node's parent. That's what this code does:

This works, but you should be careful if this is code you plan to frequently reuse. If Yahoo changed the way they format their HTML, this could stop working. If you plan to use code like this in an automated way it would be best to wrap it in a try/catch block and validate the output.

This is only a simple Beautiful Soup example, and gives you an idea of what you can do with HTML and XML parsing in Python. You can find the Beautiful Soup documentation here. You'll find a lot more tools for searching and validating HTML documents.

APIs are not always available. Sometimes you have to scrape data from a webpage yourself. Luckily the modules Pandas and Beautifulsoup can help!

Related Course:Complete Python Programming Course & Exercises

Web scraping

Pandas has a neat concept known as a DataFrame. A DataFrame can hold data and be easily manipulated. We can combine Pandas with Beautifulsoup to quickly get data from a webpage.

If you find a table on the web like this:

Web Scraping Beautiful Soup Python

Drivers digital check mobile phones & portable devices. We can convert it to JSON with:

Download eaton battery driver. And in a browser get the beautiful json output:

Converting to lists

Beautiful Soup Web Scraping Example Pdf

Rows can be converted to Python lists.
We can convert it to a dataframe using just a few lines:

Pretty print pandas dataframe

Beautiful Soup Web Scraping Example Using

You can convert it to an ascii table with the module tabulate.
This code will instantly convert the table on the web to an ascii table:
This will show in the terminal as: