Web scraping is the process of extracting data from websites. It’s a valuable skill for gathering information for data analysis, machine learning projects, or even for projects like aggregating news articles or monitoring prices. Python, with its powerful libraries, makes web scraping straightforward and efficient. In this blog, we’ll be using Beautiful Soup, a library for parsing HTML and XML documents.
Beautiful Soup is a Python library that makes it easy to scrape information from web pages. It creates parse trees from page source code that can be used to extract data easily. Beautiful Soup works with your parser to provide ways of navigating, searching, and modifying the parse tree.
Ensure you have the following installed:
You can install Beautiful Soup and Requests using pip:
pip install beautifulsoup4 requests urllib3
import requests from bs4 import BeautifulSoup from urllib.parse import urljoin, urlparse
This script uses the Requests library to fetch web pages and Beautiful Soup to parse HTML content. It also employs urljoin and urlparse from the urllib.parse module to handle URL manipulation, ensuring relative links are correctly resolved to full URLs and validating that links belong to the same domain. This combination allows for efficient and structured web scraping.
def scrape_all_links(start_url, max_depth=3): def extract_links(url, base_url): response = requests.get(url) if response.status_code == 200: soup = BeautifulSoup(response.content, 'html.parser') links = [urljoin(base_url, a['href']) for a in soup.find_all('a',href=True)] return links return [] def is_valid_url(url, base_url): # Check if the URL belongs to the same domain as the base URL base_domain = urlparse(base_url).netloc url_domain = urlparse(url).netloc return base_domain == url_domain def is_https_link(link): # Check if the link starts with https return re.match(r'^https://', link) is not None base_url = start_url visited = set() to_visit = [(start_url, 0)] all_links = set() while to_visit: current_url, depth = to_visit.pop(0) if current_url not in visited and depth <= max_depth: visited.add(current_url) links = extract_links(current_url, base_url) filtered_links = filter(is_https_link, links) all_links.update(filtered_links) for link in links: if is_valid_url(link, base_url) and link not in visited: to_visit.append((link, depth + 1) # Adding delay to avoid overwhelming the server # time.sleep(1) # Define a list of unwanted substrings unwanted_substrings = ["youtube", "linkedin", ".jpeg", ".jpg", ".png"] # Filter out unwanted links final_links = set() for link in all_links: if not any(substring in link for substring in unwanted_substrings): final_links.add(link) return list(final_links)
This script demonstrates how to effectively scrape HTTPS links from a website using Python’s Beautiful Soup and Requests libraries. It starts with a given URL, extracts links, and ensures they belong to the same domain while filtering out unwanted URLs and file types. By traversing links up to a specified depth, it can handle pagination seamlessly, making it a robust tool for gathering data from multiple pages. Additionally, the script includes mechanisms to respect server load and ensure data relevance, providing a comprehensive solution for web scraping needs.
def extract_content(url): response = requests.get(url) soup = BeautifulSoup(response.content, 'html.parser') # Get the text content of the page content = soup.get_text()
The extract_content function retrieves and parses the HTML content of a specified URL using the Requests and Beautiful Soup libraries. It sends a GET request to the given URL and checks if the response is successful. Then, it uses Beautiful Soup to parse the HTML content of the response. Finally, the function extracts and returns all the text content from the parsed HTML, which includes all the visible text on the web page. This is useful for gathering raw text data for further processing or analysis.
if __name__=="__main___": home_url = 'https://example.com/' max_depth = 3 # Set the depth you want to scrape all_links = scrape_all_links(home_url, max_depth) for link in all_links: content = extract_content(link)
This code sets the starting URL to ‘https://example.com/’ and specifies a maximum depth of 3 for web scraping. It calls the scrape_all_links function to collect all HTTPS links from the specified URL up to the given depth. Then, for each extracted link, it calls the extract_content function to retrieve and parse the text content from each linked web page. This process allows you to gather textual data from multiple pages within the same domain, facilitating comprehensive data extraction for further analysis or processing.
In conclusion, web scraping with Python using libraries like Beautiful Soup and Requests is a powerful way to gather information from websites. By setting a starting URL and depth, you can efficiently extract links and content from multiple pages within the same domain. This method allows you to automate the data collection process, making it easier to gather and analyze large amounts of web data for various applications. Whether for research, data analysis, or monitoring websites, web scraping is a valuable skill to have.
Gajalakshmi N