Python for Web Scraping with BeautifulSoup

12 Views

Python for Web Scraping with BeautifulSoup

养花风水

12 Views

Whenever one talks about data harvesting, the term web scraping pops up as that has become popular in recent years. With the great resources available online, web scraping has become the go to tool for most people. Owing to its ease and considerable amount of libraries, Python has also grown to be one of the most preferred languages for scraping websites. In this context, one of the most efficient and widely employed libraries is BeautifulSoup. This article is intended to grasp the concept of web scraping focusing on the Python language and the BeautifulSoup library with the aim of extracting and processing data from web pages.

How Do You Define Web Scraping

To put it simply, web scraping is the extraction of data from websites. Content on many sites is presented in HTML, which is a format created for people to read. But HTML can also be created for machines and algorithms to read, which facilitates web scraping. In simple words, web scraping is the process of compiling the data found on various pages and saving in the desired format for analysis, research or any other use in the future.

Even if web scraping appears to be remarkably beneficial, it should be used responsibly. Certain websites explicitly refuse scraping in their Terms of Use, so always be sure to comply with the posted rules of the site you are scraping data from.

Introduction to BeautifulSoup

BeautifulSoup is a library available in Python that does the parsing and navigation of HTML documents in a faster and easier way. It provides easy and usable methods for moving around in the webpage's HTML tree, making it easier to extract certain pieces of information. You may also use BeautifulSoup to manipulate scripts of Java that dynamically creates HTML content. One of the reasons a lot of people are able to use Broudie Soup is its ease of use and user interface. A handful of command text is sufficient for you to start scraping. It also resolves most common problems such as broken or non-standard HTML pages. There are many HTML parsers, and BeautifulSoup is said to be functional with these, but most of the time, it is fine with the inbuilt one.

Setting Up The Necessary Libraries

As noted earlier, before starting to deal with web scraping through BeautifulSoup, one has to get the appropriate libraries. The two big libraries you will need first are `requests` and `beautifulsoup4`. `requests` library makes it possible for you to obtain the contents of the web and `beautifulsoup4` serves the purpose of parsing and traversing through the HTML.

To set up these libraries, you can use the command below in your terminal or command prompt:

  pip install requests beautifulsoup4

As soon as you install that, it's time to start coding in python to crawl through sites.

Getting Contents of The Web

The first step of any web scraping activity is getting the page of the site you wish to scrape. And to do this, you employ the use of the `requests` library. `requests.get()` issues an HTTP GET request to the designated universal resource locator (url) and brings back the page's html.

Use the code below to see how to make requests and get a webpage:

  import requests
 

 
url = "https://example.com"
 

 
response = requests.get(url)
 

 
html_content = response.text

In this example, `response.text' has the HTML of the page. Now that you have this HTML, you can pass it to Beautifulsoup which will begin parsing and extracting data.

BeautifulSoup HTML Parsing

When you HTML content of a page, it is not useful in such a state as it needs to be parsed in order to be worked with. To assist you in this regard, BeautifulSoup is provided. The `BeautifulSoup` class takes in the html content as its argument and then constructs a parse tree which helps you in searching and traversing the html.

Wondering how to create a soup? Here's how:

  from bs4 import BeautifulSoup
 

 
soup = BeautifulSoup(html_content, 'html.parser')

In this code, … tells BeautifulSoup which parser to employ. Other parsers can be obtained, yet the standard one suffices for the majority of tasks

How to Move in the HTML

An HTML file is made in the form of a tree having a set of elements arranged in a hierarchical form. BeautifulSoup also has some functions that allow you to navigate and search through this tree. One can search for specific tags, get content enclosed within those tags or get the attributes of those tags.

Searching for an Exact Tag

Whenever you require an extraction of a particular HTML element tag, you simply apply the `find()` function. For example, in order to get the content of the first header on the page, now using h1 we can use:

  h1_tag = soup.find('h1')
 
print(h1_tag.text)

So here in `soup.find('h1')` the first item in the list generated by the tag '< h1 > ' is searched and then by the use of '.text', the textual content within the tag is retrieved.

Bulk retrieving of tags

If you further want to enhance the search, like looking to search for all the anchor tags < a > on the page, you can simply apply the find_all method. This method simply returns the entire match result of the applied tag as an array.

  a_tags = soup.find_all('a')
 
for a in a_tags:
 
     print(a.get('href'))

For instance, in this example, `soup.find_all('a')` clicks on all the anchor items in the list whilst a.get('href') then picks the href address of individual links in the list.

Extracting Information inside the Tags

Many HTML tags come with additional information within them that might be helpful in accomplishing certain tasks. An example being: anchor tags which will come with an `href' tag, which would specify the link's website. And to achieve this, the `get()` method will do the trick

For instance, if you wish to retrieve the src from an image tag code, which possesses the URL of an image, then it can be done in the following way.

  img_tag = soup.find('img')
 
img_url = img_tag.get('src')
 
print(img_url)

This will yield the URL of the image, that of the src of the first image tag found on the page.

Traversing the HTML Document

Traversing the HTML Document is one of the defining characteristics of BeautifulSoup. As we all know every tag in the Html document is associated with a lot of attributes and methods that enable one to traverse through the tree structure. For example, Accessing the parent, child or sibling elements of a given tag.

Suppose you want to know the parent element of a given tag, then you can execute the following code.

  child_tag = soup.find('p')
 
parent_tag = child_tag.parent
 
print(parent_tag)

This will indeed print the first parent tag of the first p tag available on the webpage.

Looking up Content that is Dynamically Loaded

There are many websites in this world that are using JavaScript and therefore the site content has to be loaded up. Beautiful soup unfortunately is not able to help out here and render the content as it only parses the content of the static HTML documents. Yes, but you can use the Python libraries for example Selenium or Playwright to load Java scripts and fetch the HTML content that the script will be offering.

Once you have the final HTML in your hand, you can forward it to BeautifulSoup for fresh parsing.

Conclusion

Let's face it, Web scraping is probably one of the most powerful ways to collect data from the web, not to mention that Python has an assistance called the BeautifulSoup library which gives functionality to easily navigate or parse through cluttered HTML content. BeautifulSoup in combination with the requests library empowers you to visit web pages, find them useful and edit content for other purposes. As you will delve into the topic of web-scraping, you will come across more sophisticated ways like filling out forms and sending them, sending, and receiving cookies, and sending requests using AJAX.