
In this post, you’ll learn how to extract passages from your content based on how closely related they are to queries you input. You can use this as a simple way to begin optimizing your content for SEO and GEO.
Tools and data you will need
- Google Colab (or Python)
- Google Drive
- A list of pages you want to chunk, embed, and analyze; we’ll use a sitemap
- Your grounding queries
Setup
Installation
Create a new notebook in Google Colab. I also recommend creating a few text blocks with headings, so you can segment the code blocks and hide any unnecessary ones, like the installation block after you complete the process.
In the first code block, install the following packages:
!pip install trafilatura langchain chromadb sentence-transformers langchain-text-splitters ultimate-sitemap-parser
Mounting the drive and import necessary functions
In the second code block, we’ll go ahead and mount your Google Drive and load in the database you created in my previous tutorial. For the sake of consistency, I’m guessing you’re doing this with the same site, so we’re keeping all of the data in the same ChromaDB instance.
Also, this code block includes a few extra package imports, because my notebook is set up in function-based segments, and I don’t want to import every single time I use a query to extract passages in another block below.
from google.colab import drive
import os
import pandas as pd
import chromadb
from chromadb.utils import embedding_functions
# 1. Mount Google Drive
drive.mount('/content/drive')
# 2. Define your local path (this folder will appear on your desktop via Google Drive sync)
DB_PATH = "/content/drive/My Drive/chroma_keywords_db"
os.makedirs(DB_PATH, exist_ok=True)
Chunking and vector embedding your sitemap
In the third code block, we’ll crawl a sitemap you provide to extract a list of URLs, then the main content of those URLs will be extracted, chunked, and vector embedded. Before you run the code below, make sure to make the following changes:
- Under URL_OR_SITEMAP, set it to be either your homepage, your sitemap index, or a specific sitemap.
- Under DB_PATH, set it to be the location of your ChromaDB.
A few notes:
- You can set the name of your ChromaDB collection by adjusting the name present in name=””.
- The chunking for this example exercise is set at 800 characters, meaning that segments of text will be pulled every 800 characters, with a 100 character overlap each time. You can set these in the chunk_size and chunk_overlap variables below.
from usp.tree import sitemap_tree_for_homepage
import trafilatura
import pandas as pd
from langchain_text_splitters import RecursiveCharacterTextSplitter
from chromadb.utils import embedding_functions
import chromadb
# --- SETUP ---
# You can use the homepage URL or the direct sitemap.xml URL
URL_OR_SITEMAP = "https://example.com/sitemap_index.xml"
DB_PATH = "/content/drive/My Drive/chroma_keywords_db"
client = chromadb.PersistentClient(path=DB_PATH)
ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")
page_collection = client.get_or_create_collection(name="site_content", embedding_function=ef)
# 1. Fetch URLs using Ultimate Sitemap Parser
print("Fetching sitemap...")
tree = sitemap_tree_for_homepage(URL_OR_SITEMAP)
urls = [page.url for page in tree.all_pages()]
print(f"Found {len(urls)} URLs. Starting crawl...")
# 2. Text Splitter Config
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=800,
chunk_overlap=100
)
# 3. Scrape and Ingest
for url in urls:
try:
downloaded = trafilatura.fetch_url(url)
if downloaded:
content = trafilatura.extract(downloaded)
if content:
chunks = text_splitter.split_text(content)
page_collection.add(
documents=chunks,
metadatas=[{"url": url} for _ in chunks],
ids=[f"{url}_{i}" for i in range(len(chunks))]
)
print(f"✅ Indexed: {url}")
except Exception as e:
print(f"❌ Failed to process {url}: {e}")
print("\nAll done! Your site is now embedded in ChromaDB.")
Ad hoc passage extraction
In this final necessary code block, you will have the opportunity to input a query, which will then be matched against your database of article chunks. It will select the three closest results that match your query and present them, with the corresponding URL, chunk of text, and the distance in relatedness to your query (the closer the zero, the more related it is).
sample_query = "" # @param {"type":"string","placeholder":"Add your query fan out here."}
# Search for the exact content match
rag_results = page_collection.query(
query_texts=[sample_query],
n_results=3
)
print(f"Copilot Query: {sample_query}\n")
print("Top Grounding Evidence from your Site:")
for i in range(3):
print(f"--- Result {i+1} ---")
print(f"Source URL: {rag_results['metadatas'][0][i]['url']}")
print(f"Snippet: {rag_results['documents'][0][i]}")
print(f"Distance: {rag_results['distances'][0][i]:.4f}\n")
If you want to delete your database and start over
Lastly, if you want to reset your embedded database, use the code below. In the name element, set it to the name of your database.
As a precaution, I’ve commented out the command.
# WARNING: This deletes all data in that collection; uncomment the line below to run it
# client.delete_collection(name="site_content")