AI Web Scraper

This AI-powered web scraper allows you to fetch data from any website, clean it up, and parse specific information using a large language model (LLM). Inspired by a project from TechWithTim, this project features improvements and customization to streamline the process of obtaining and processing data.

Project Overview

The Scraper:

  1. Takes a website URL as input.
  2. Extracts and cleans the page content.
  3. Allows you to specify the type of information to extract from the data.
  4. Uses an LLM model to analyze and parse the only relevant information

Installation

To Install the necessary dependancies:

  1. Create a Virtual Environment
python -m venv /path/to/new/virtual/environment
  1. Install the dependenices
    • streamlit: For creating a web interface.
    • selenium: For automated web scraping.
    • beautifulsoup4, lxml, html5lib: For parsing and cleaning HTML content.
    • langchain and langchain_ollama: For interfacing with the LLM model.
pip install -r requirements.txt

How to Use the Application

Step 1: Launch the app

use the following command

streamlit run main.py

Step 2: Input the URL to Scrape

Enter the URL of the site you want to scrape in the provided input box, then click “Scrape Website”. The app will display the raw page content in an expandable section.

Step 3: Specify the Data to Extract

In the Parse Content section, describe the data you want extracted. For example, you could type “Get all the headings and product names.”

Step 4: Parse the Content

Click “Parse Content” to use the LLM model to extract and display only the information matching your description.

Code Snippets

Here are a few core functions from the project:

Web Scraping with Seleniu,

The “scrape_site” function loads a website and retrieves its HTML content:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

def scrape_site(website):
    chrome_options = Options()
    chrome_options.add_argument("--headless")
    driver = webdriver.Chrome(options=chrome_options)
    driver.get(website)
    html = driver.page_source
    driver.quit()
    return html
Parsing with BeautifulSoup

After scraping, extract_content and clean_content prepare the HTML content:

from bs4 import BeautifulSoup

def extract_content(html_content):
    soup = BeautifulSoup(html_content, "html.parser")
    return str(soup.body)

def clean_content(body_content):
    soup = BeautifulSoup(body_content, "html.parser")
    for tag in soup(["script", "style"]):
        tag.extract()
    return "\\n".join(line.strip() for line in soup.get_text().splitlines() if line.strip())
Parsing with the LLM Model

Using the langchain_ollama library, parse_with_ollama extracts specific data:

from langchain_ollama import OllamaLLM

model = OllamaLLM(model="llama3")

def parse_with_ollama(dom_chunks, parse_description):
    parsed_results = []
    for chunk in dom_chunks:
        response = model({"dom_content": chunk, "parse_description": parse_description})
        parsed_results.append(response)
    return "\\n".join(parsed_results)

Credits

This project was inspired by TechWithTim. Special thanks for the foundational concepts, which have been adapted and expanded to meet additional requirements.