AI Web Scraper
This AI-powered web scraper allows you to fetch data from any website, clean it up, and parse specific information using a large language model (LLM). Inspired by a project from TechWithTim, this project features improvements and customization to streamline the process of obtaining and processing data.
Project Overview
The Scraper:
- Takes a website URL as input.
- Extracts and cleans the page content.
- Allows you to specify the type of information to extract from the data.
- Uses an LLM model to analyze and parse the only relevant information
Installation
To Install the necessary dependancies:
- Create a Virtual Environment
python -m venv /path/to/new/virtual/environment
- Install the dependenices
- streamlit: For creating a web interface.
- selenium: For automated web scraping.
- beautifulsoup4, lxml, html5lib: For parsing and cleaning HTML content.
- langchain and langchain_ollama: For interfacing with the LLM model.
pip install -r requirements.txt
How to Use the Application
Step 1: Launch the app
use the following command
streamlit run main.py
Step 2: Input the URL to Scrape
Enter the URL of the site you want to scrape in the provided input box, then click “Scrape Website”. The app will display the raw page content in an expandable section.
Step 3: Specify the Data to Extract
In the Parse Content section, describe the data you want extracted. For example, you could type “Get all the headings and product names.”
Step 4: Parse the Content
Click “Parse Content” to use the LLM model to extract and display only the information matching your description.
Code Snippets
Here are a few core functions from the project:
Web Scraping with Seleniu,
The “scrape_site” function loads a website and retrieves its HTML content:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
def scrape_site(website):
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(options=chrome_options)
driver.get(website)
html = driver.page_source
driver.quit()
return html
Parsing with BeautifulSoup
After scraping, extract_content and clean_content prepare the HTML content:
from bs4 import BeautifulSoup
def extract_content(html_content):
soup = BeautifulSoup(html_content, "html.parser")
return str(soup.body)
def clean_content(body_content):
soup = BeautifulSoup(body_content, "html.parser")
for tag in soup(["script", "style"]):
tag.extract()
return "\\n".join(line.strip() for line in soup.get_text().splitlines() if line.strip())
Parsing with the LLM Model
Using the langchain_ollama library, parse_with_ollama extracts specific data:
from langchain_ollama import OllamaLLM
model = OllamaLLM(model="llama3")
def parse_with_ollama(dom_chunks, parse_description):
parsed_results = []
for chunk in dom_chunks:
response = model({"dom_content": chunk, "parse_description": parse_description})
parsed_results.append(response)
return "\\n".join(parsed_results)
Credits
This project was inspired by TechWithTim. Special thanks for the foundational concepts, which have been adapted and expanded to meet additional requirements.