Web Scraping with Python and Beautiful Soup

Blogs

Migrating SQL Server logins and user from RDS instance to another server
July 22, 2024
Exploring Time Series Analysis: Techniques for Data Smoothing and Forecasting
July 22, 2024

Web Scraping with Python and Beautiful Soup

Introduction:

Web scraping is the process of extracting data from websites. It’s a valuable skill for gathering information for data analysis, machine learning projects, or even for projects like aggregating news articles or monitoring prices. Python, with its powerful libraries, makes web scraping straightforward and efficient. In this blog, we’ll be using Beautiful Soup, a library for parsing HTML and XML documents.

What is Beautiful Soup?

Beautiful Soup is a Python library that makes it easy to scrape information from web pages. It creates parse trees from page source code that can be used to extract data easily. Beautiful Soup works with your parser to provide ways of navigating, searching, and modifying the parse tree.

Prerequisites

Ensure you have the following installed:

  • Python
  • Beautiful Soup
  • Requests
  • Urllib

You can install Beautiful Soup and Requests using pip:

pip install beautifulsoup4 requests urllib3

Step-by-Step Guide to Web Scraping

  1. Import Necessary Libraries

    import requests
    from bs4 import BeautifulSoup
    from urllib.parse import urljoin, urlparse

This script uses the Requests library to fetch web pages and Beautiful Soup to parse HTML content. It also employs urljoin and urlparse from the urllib.parse module to handle URL manipulation, ensuring relative links are correctly resolved to full URLs and validating that links belong to the same domain. This combination allows for efficient and structured web scraping.

  1. Fetch the Web Page including Handling Paginations

    def scrape_all_links(start_url, max_depth=3):
      def extract_links(url, base_url):
         response = requests.get(url)
         if response.status_code == 200:
            soup = BeautifulSoup(response.content, 'html.parser')
            links = [urljoin(base_url, a['href']) for a in soup.find_all('a',href=True)]
         return links
      return []

      def is_valid_url(url, base_url):
      # Check if the URL belongs to the same domain as the base URL
         base_domain = urlparse(base_url).netloc
         url_domain = urlparse(url).netloc
         return base_domain == url_domain

      def is_https_link(link):
        # Check if the link starts with https
         return re.match(r'^https://', link) is not None

      base_url = start_url
      visited = set()
      to_visit = [(start_url, 0)]
      all_links = set()

      while to_visit:
       current_url, depth = to_visit.pop(0)
       if current_url not in visited and depth <= max_depth:
          visited.add(current_url)
          links = extract_links(current_url, base_url)
          filtered_links = filter(is_https_link, links)
          all_links.update(filtered_links)
          for link in links:
              if is_valid_url(link, base_url) and link not in visited:
                 to_visit.append((link, depth + 1)

          # Adding delay to avoid overwhelming the server
          # time.sleep(1)

     # Define a list of unwanted substrings
      unwanted_substrings = ["youtube", "linkedin", ".jpeg", ".jpg", ".png"]

     # Filter out unwanted links
      final_links = set()
      for link in all_links:
       if not any(substring in link for substring in unwanted_substrings):
          final_links.add(link)
      return list(final_links)

This script demonstrates how to effectively scrape HTTPS links from a website using Python’s Beautiful Soup and Requests libraries. It starts with a given URL, extracts links, and ensures they belong to the same domain while filtering out unwanted URLs and file types. By traversing links up to a specified depth, it can handle pagination seamlessly, making it a robust tool for gathering data from multiple pages. Additionally, the script includes mechanisms to respect server load and ensure data relevance, providing a comprehensive solution for web scraping needs.

  1. Parse and Extract Data

   def extract_content(url):
      response = requests.get(url)
      soup = BeautifulSoup(response.content, 'html.parser')
      # Get the text content of the page
      content = soup.get_text()

The extract_content function retrieves and parses the HTML content of a specified URL using the Requests and Beautiful Soup libraries. It sends a GET request to the given URL and checks if the response is successful. Then, it uses Beautiful Soup to parse the HTML content of the response. Finally, the function extracts and returns all the text content from the parsed HTML, which includes all the visible text on the web page. This is useful for gathering raw text data for further processing or analysis.

  1. Main function

   if __name__=="__main___":
     home_url = 'https://example.com/'
     max_depth = 3 # Set the depth you want to scrape
     all_links = scrape_all_links(home_url, max_depth)
     for link in all_links:
         content = extract_content(link)

This code sets the starting URL to ‘https://example.com/’ and specifies a maximum depth of 3 for web scraping. It calls the scrape_all_links function to collect all HTTPS links from the specified URL up to the given depth. Then, for each extracted link, it calls the extract_content function to retrieve and parse the text content from each linked web page. This process allows you to gather textual data from multiple pages within the same domain, facilitating comprehensive data extraction for further analysis or processing.

Best Practices

  1. Use Delay Between Requests: To avoid overloading the server, include a delay between requests.
  2. Handle Exceptions: Network requests can fail, so include error handling in your script.

Conclusion

In conclusion, web scraping with Python using libraries like Beautiful Soup and Requests is a powerful way to gather information from websites. By setting a starting URL and depth, you can efficiently extract links and content from multiple pages within the same domain. This method allows you to automate the data collection process, making it easier to gather and analyze large amounts of web data for various applications. Whether for research, data analysis, or monitoring websites, web scraping is a valuable skill to have.


Gajalakshmi N

Leave a Reply

Your email address will not be published. Required fields are marked *