How to Build Your Own Semantic Related Passage Extractor for SEO and GEO

In this post, you’ll learn how to extract passages from your content based on how closely related they are to queries you input. You can use this as a simple way to begin optimizing your content for SEO and GEO.

Tools and data you will need

Google Colab (or Python)
Google Drive
A list of pages you want to chunk, embed, and analyze; we’ll use a sitemap
Your grounding queries

Setup

Installation

Create a new notebook in Google Colab. I also recommend creating a few text blocks with headings, so you can segment the code blocks and hide any unnecessary ones, like the installation block after you complete the process.

In the first code block, install the following packages:

!pip install trafilatura langchain chromadb sentence-transformers langchain-text-splitters ultimate-sitemap-parser

Mounting the drive and import necessary functions

In the second code block, we’ll go ahead and mount your Google Drive and load in the database you created in my previous tutorial. For the sake of consistency, I’m guessing you’re doing this with the same site, so we’re keeping all of the data in the same ChromaDB instance.

Also, this code block includes a few extra package imports, because my notebook is set up in function-based segments, and I don’t want to import every single time I use a query to extract passages in another block below.

from google.colab import drive

import os

import pandas as pd

import chromadb

from chromadb.utils import embedding_functions

# 1. Mount Google Drive

drive.mount('/content/drive')

# 2. Define your local path (this folder will appear on your desktop via Google Drive sync)

DB_PATH = "/content/drive/My Drive/chroma_keywords_db"

os.makedirs(DB_PATH, exist_ok=True)

Chunking and vector embedding your sitemap

In the third code block, we’ll crawl a sitemap you provide to extract a list of URLs, then the main content of those URLs will be extracted, chunked, and vector embedded. Before you run the code below, make sure to make the following changes:

Under URL_OR_SITEMAP, set it to be either your homepage, your sitemap index, or a specific sitemap.
Under DB_PATH, set it to be the location of your ChromaDB.

A few notes:

You can set the name of your ChromaDB collection by adjusting the name present in name=””.
The chunking for this example exercise is set at 800 characters, meaning that segments of text will be pulled every 800 characters, with a 100 character overlap each time. You can set these in the chunk_size and chunk_overlap variables below.

from usp.tree import sitemap_tree_for_homepage

import trafilatura

import pandas as pd

from langchain_text_splitters import RecursiveCharacterTextSplitter

from chromadb.utils import embedding_functions

import chromadb

# --- SETUP ---

# You can use the homepage URL or the direct sitemap.xml URL

URL_OR_SITEMAP = "https://example.com/sitemap_index.xml"

DB_PATH = "/content/drive/My Drive/chroma_keywords_db"

client = chromadb.PersistentClient(path=DB_PATH)

ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")

page_collection = client.get_or_create_collection(name="site_content", embedding_function=ef)

# 1. Fetch URLs using Ultimate Sitemap Parser

print("Fetching sitemap...")

tree = sitemap_tree_for_homepage(URL_OR_SITEMAP)

urls = [page.url for page in tree.all_pages()]

print(f"Found {len(urls)} URLs. Starting crawl...")

# 2. Text Splitter Config

text_splitter = RecursiveCharacterTextSplitter(

    chunk_size=800,

    chunk_overlap=100

)

# 3. Scrape and Ingest

for url in urls:

    try:

        downloaded = trafilatura.fetch_url(url)

        if downloaded:

            content = trafilatura.extract(downloaded)

            if content:

                chunks = text_splitter.split_text(content)

                page_collection.add(

                    documents=chunks,

                    metadatas=[{"url": url} for _ in chunks],

                    ids=[f"{url}_{i}" for i in range(len(chunks))]

                )

                print(f"✅ Indexed: {url}")

    except Exception as e:

        print(f"❌ Failed to process {url}: {e}")

print("\nAll done! Your site is now embedded in ChromaDB.")

Ad hoc passage extraction

In this final necessary code block, you will have the opportunity to input a query, which will then be matched against your database of article chunks. It will select the three closest results that match your query and present them, with the corresponding URL, chunk of text, and the distance in relatedness to your query (the closer the zero, the more related it is).

sample_query = "" # @param {"type":"string","placeholder":"Add your query fan out here."}

# Search for the exact content match

rag_results = page_collection.query(

    query_texts=[sample_query],

    n_results=3

)

print(f"Copilot Query: {sample_query}\n")

print("Top Grounding Evidence from your Site:")

for i in range(3):

    print(f"--- Result {i+1} ---")

    print(f"Source URL: {rag_results['metadatas'][0][i]['url']}")

    print(f"Snippet: {rag_results['documents'][0][i]}")

    print(f"Distance: {rag_results['distances'][0][i]:.4f}\n")

If you want to delete your database and start over

Lastly, if you want to reset your embedded database, use the code below. In the name element, set it to the name of your database.

As a precaution, I’ve commented out the command.

# WARNING: This deletes all data in that collection; uncomment the line below to run it

# client.delete_collection(name="site_content")

Thoughts? Feedback?

Send me a message on LinkedIn.