How to Automatically Optimize your SEO Metadata with FuzzyWuzzy and OpenAI in Google Colab

Optimize page metadata with FuzzyWuzzy and OpenAI

If you’re looking to enhance your website’s SEO without spending hours manually tweaking metadata, you’re in the right place. In this tutorial, I’ll introduce you to a variation of a method that I’ve created years ago and named “Fuzzy Booster”. 

In this tutorial, we’ll automatically optimize your SEO metadata by analyzing top-performing queries and crafting better titles, headers, and descriptions. You will learn how to use Python, FuzzyWuzzy, and OpenAI in Google Colab.

About the method: What is Fuzzy Booster?

The main idea is to ensure that your page titles, headers (h1), and descriptions match the search queries that bring you the most visitors. By doing so, we can improve out site’s click-through rates and rankings on search engines.

Normally, optimizing these elements involves a lot of manual labor, but with the power of Python and some python libraries, we can automate this process. We’ll extract top-performing queries from Google Search Console, check how well they align with your current metadata using FuzzyWuzzy, and even generate new, optimized content using OpenAI. Pretty cool, right?

Tools and libraries

Google Search Console API

This API provides access to detailed performance data about your website. We’ll use it to retrieve the top-performing queries for each URL, based on the number of clicks.

BeautifulSoup

BeautifulSoup is a Python library that helps us scrape and extract HTML data from web pages. We’ll use it to pull existing SEO metadata from your site.

FuzzyWuzzy

FuzzyWuzzy is a library that performs fuzzy string matching, ideal for comparing your current SEO metadata to top search queries to identify areas for improvement.

The token_set_ratio function is particularly effective because it measures similarity between strings regardless of word order, making it useful for assessing how well your metadata matches search queries.

OpenAI

OpenAI provides us with advanced language models like GPT, which we’ll use to automatically generate new SEO metadata that’s more aligned with top queries.

Additional Resources

Check out also the following resources to see documentation on the modules we’ll be using. This can be useful for troubleshooting if you run into errors with the code at any point:

  1. Google Search Console API Documentation
  2. BeautifulSoup Documentation
  3. FuzzyWuzzy Documentation
  4. OpenAI API Documentation

Step-by-step guide on using Fuzzy Booster for SEO metadata optimization in Google Colab

Prerequisites

Before you start, you’ll need:

  • A Google Cloud account with billing enabled.
  • A project in Google Cloud Console with the Search Console API enabled.
  • An OAuth 2.0 Client ID and JSON credentials file.
  • An OpenAI API Key

How to set up your Google Cloud Console project

  1. Create a project in Google Cloud Console:
AD 4nXd52Wc8 F O381ax944beTLnEPIQfQRyUYGHf533YbjYZsGCa4N8Pu24OR9Y2xT7FDUmg9dFoeidQxSFKqXpv1 5XIVZfKFaeUffIQ8X0dkuhtV6OOC1sALgHcnatv0q c vkUbSVpoUzn6tDYd0CD9Z0GU?key=suX6XBedEA9iYsumR0rtXw
AD 4nXfD eY1tiWBQnQSKeE SX4Lby6 mV9rxO4gY9Htr8MGqf3DoSFYIc5nsosvCF AHzJWyHKLGvC1BFIcAzVOACiECm8OeuqrSu0hzXyMSDsL070al8ryu77gKoxssJ 6H1wPvNmwSrVb3d1j 7t584Sty47k?key=suX6XBedEA9iYsumR0rtXw
  1. Enable the Search Console API:
    • Go to API & Services > Library.
AD 4nXfQFcLvrwRzMs6KptCq3tnfGRh9bx7MCsXrlZ3mFqS001aCqEb0 cwuobqfpHKno5T p1SiIrkbDrUNIZMJhtBFsxg MggC6LUm8AeebAd7Z3HtZjfK1lHpn nUpawseXdMeF7JVHC2uq9KmoG15G 9pK6b?key=suX6XBedEA9iYsumR0rtXw
  • Search for “Google Search Console API” and enable it.
AD 4nXf KFX5CMn70yLHy iGDK1CCshy6MoiulUv0
AD 4nXfcc0HAKmpHsqSgmFFETGU3HyEBQ8hVebvn0TFVKhKnCFMwLmWxI7S FHVZDhKJPmILaP9WPn2U0P IdFSToneDkY18eAfry
  1. Create OAuth 2.0 Credentials:
    • Navigate to API & Services > Credentials.
    • Click Create Credentials > OAuth 2.0 Client ID.
AD 4nXeUsMaz lzM5AZ6zgzqmkQGtyN Q7P5kNSyKoxKXWBAOUSxDiw9JZ9sbqCt1mu28hEhbHSfgyL8cLktwXKNSdVqMxBZOoTJBVP2sp3ovWJfWEjFGwTZxRJ5x
  • Set up the OAuth consent screen if it’s your first time and add User Scopes to Google Search Console Api > Read Only.
AD 4nXdouPbWUfCMZRb0UCNYFbUY8zvYT3Hgmazcv1kzWVjAmpFoYVTDaJoGsLJ 2DWQ lbUo28kc41CWYSNzZRC y7A2pV0GDsf9S9LYb3Lf50nZSFBfYyZlVuq50sCqEudAf4YwBlg1EBvW33Bp7dB6pDWFW7?key=suX6XBedEA9iYsumR0rtXw
AD 4nXe Gt2 Hamavf4XYStCJV0PX2BheVRmYI6XkQgLcyGJJLfWMQPqsfFuGqPVQhVsJbcutkyO91GnUv0DNLX4gIkP3oKva5OpITiLalaAZgkpe GwxseiM 5qViWoAO4VxK3hcAUEJnfXFNFctnMDYLH QIWE?key=suX6XBedEA9iYsumR0rtXw
AD 4nXcSfB8BkhwTl8XgwzdpFilYW3w45OExRBkUiSnAmqkBcRfiEfBi3iRrVYzAdZFx0kk1x09LjCOTk94tadSYMi6eaFlS5ki5pDkPulCXHCXTaqASYQ KzGqeL4P23CC6t WI5amD5zXJiRDnmyvrV 5enyTS?key=suX6XBedEA9iYsumR0rtXw
  • Create a new OAuth 2.0 Client ID for desktop applications.
AD 4nXeyefXpO N5UZh Div7rZrB1x97LAf9K4h7HxS Rp5W7ZvQJB2pEJisavB1QhAiqsb8VRdPYRRa4rRgFMrveGIIFEQptvMXH8cgmDa c4NBbsOcum 5X6iDUkQfeFfHjC OJ6wU34iiv7lymVBSb8D0Q3aZ?key=suX6XBedEA9iYsumR0rtXw
  • Download the JSON credentials file.

How to Set up your OpenAI API Key

To use OpenAI for generating optimized SEO metadata, you’ll need to create an API key. Follow these steps to set up your OpenAI account and obtain the key:

  1. Create an OpenAI account:
    • Go to the OpenAI website.
    • Sign up for a new account if you don’t have one. If you already have an account, just log in.
  2. Generate an API Key:
    • In the API dashboard, look for the “API Keys” section.
    • Click on “Create API Key” or “New Secret Key” to generate a new key.
    • Give your key a name for easy identification.
AD 4nXeq0k3TXmpXPFN ohXv7yM2Cf3RmG7vUHKO ba07 AuZke4CNlX2SQOHhTqTC6Re9 phCbc
  1. Save your API Key:
    • Once the key is generated, make sure to copy it and save it in a secure location. You will use this key to authenticate requests to the OpenAI API in your Google Colab environment.
  2. Configure billing:
    • Configure billing settings to ensure you have access to the API features.

How to use the Fuzzy Booster Google Colab for meta data optimization

Complete steps from above, then, make a copy of the Google Colab template.

Step 1: Define query parameters

In this analysis, we usually exclude branded queries—those that include the brand name or specific product names. The reason for this is that branded searches are generally influenced by existing brand awareness and marketing efforts, which might not reflect your site’s organic search performance accurately. The goal here is to identify organic, non-branded SEO keywords that potential customers use to find your site. 

However, feel free to modify this approach if you wish to include branded queries, especially if brand visibility and recognition are part of your core objectives.

The home page is often excluded from this analysis because it’s usually optimized for broad and general terms, which might not be as actionable as insights gathered from more specific internal pages. If your strategy requires insights from the home page or you believe there’s potential for further optimization, you can choose not to exclude it.

# Select the date range
end_date = datetime.date.today()
start_date = end_date – relativedelta.relativedelta(months=16)

# Select the site
site = ‘https://www.analistaseo.es/’ # https://www.example.com/ for a URL-prefix property or sc-domain:example.com for a Domain property

home = “^”+site+”$” # Regex to match the home page

branded_queries = ‘natzir|analistaseo|analista seo’ # Branded queries to exclude

# Request payload for Google Search Console API
request = {
    ‘startDate’: datetime.datetime.strftime(start_date,’%Y-%m-%d’),
    ‘endDate’: datetime.datetime.strftime(end_date,’%Y-%m-%d’),
    ‘dimensions’: [‘page’,’query’],
    ‘dimensionFilterGroups’: [
    {
      ‘filters’: [
        { # Excluding home page
          ‘dimension’: ‘page’,
          ‘operator’: ‘excludingRegex’,
          ‘expression’: home
        },
        { # Excluding branded queries
          ‘dimension’: ‘query’,
          ‘operator’: ‘excludingRegex’,
          ‘expression’: branded_queries
        }
      ]
    }
  ],
    ‘rowLimit’: 25000 # Up to 25,000 URLs
}

# Execute the request to the Search Console API
response = service.searchanalytics().query(siteUrl=site, body=request).execute()
print(“Getting Google Search Console…”)

# Parse the JSON response
scDict = defaultdict(list)

for row in response[‘rows’]:
    scDict[‘page’].append(row[‘keys’][0] or 0)
    scDict[‘query’].append(row[‘keys’][1] or 0)
    scDict[‘clicks’].append(row[‘clicks’] or 0)
    scDict[‘ctr’].append(row[‘ctr’] or 0)
    scDict[‘impressions’].append(row[‘impressions’] or 0)
    scDict[‘position’].append(row[‘position’] or 0)

# Create a DataFrame from the parsed data
df = pd.DataFrame(data=scDict)

Step 2: Upload your credentials file

Once you have set up the data for the analysis, Colab will ask you to upload the JSON credentials file you downloaded before.

from google.colab import files

# Upload the credentials file
uploaded = files.upload()

# Save the credentials file
with open(‘credentials.json’, ‘wb’) as f:
    f.write(uploaded[list(uploaded.keys())[0]])

Step 3: Scrape metadata with BeautifulSoup

Using BeautifulSoup, we can pull existing SEO metadata from your site, like title, h1  and descriptions.

# Function to extract metadata from a URL
def get_meta(url):
    try:
        response = requests.get(url)
        encoding = chardet.detect(response.content)[‘encoding’]
        if encoding:
            page_content = response.content.decode(encoding)
        else:
            page_content = response.content
        soup = BeautifulSoup(page_content, ‘html.parser’)
        title = soup.find(‘title’).get_text() if soup.find(‘title’) else ‘No title’ # Get the title
        meta = soup.select(‘meta[name=”description”]’)[0].attrs[“content”] if soup.select(‘meta[name=”description”]’) else ‘No meta description’ # Get the meta description
        h1 = soup.find(‘h1’).get_text() if soup.find(‘h1’) else ‘No h1’ # Get the first h1
        return title, meta, h1
    except Exception as e:
        return ‘Error’, ‘Error’, ‘Error’

# Apply the function and add the results to the DataFrame
df[‘title’], df[‘meta’], df[‘h1’] = zip(*df[‘page’].apply(get_meta))
df

It’s very important to note whether you can scrape a website beforehand by checking the terms of use. Some websites (although public) do not allow scraping and will ban you if you scrape their content or metadata.

If you find yourself banned by a website during scraping, you can use the curl_cffi library to change your user agent. This helps you rotate user agents and avoid detection, allowing you to continue accessing the site. Check this Python library here.

Step 4: Clean and prepare text data

In this step, we focus on cleaning the text data extracted from your website to make it easier to analyze and compare with search queries.

lang = ‘spanish’  # Language for stopwords

We set the language to Spanish, which helps us identify common words (stopwords) that don’t add much meaning to our analysis, such as “and,” “the,” and “is.” Change this to the language of your website.

# Function to clean text
def clean_text(text):
    text = str(text).lower().strip()
    text = re.sub(r’\d+’, ”, text)
    text = re.sub(r'[^\w\s]’, ”, text)
    text = re.sub(r’\s+’, ‘ ‘, text)
    text = unicodedata.normalize(‘NFKD’, text).encode(‘ascii’, errors=’ignore’).decode(‘utf-8’)

    stop_words = set(stopwords.words(lang))
    words = text.split()
    words_cleaned = [word for word in words if word.lower() not in stop_words]
    return ‘ ‘.join(words_cleaned)

# Clean specified columns
columns_to_clean = [‘title’, ‘meta’, ‘h1’, ‘query’]

for column in columns_to_clean:
    df[column + ‘_clean’] = df[column].apply(clean_text)

This function performs several operations to clean the text:

  • Lowercasing: Converts all characters to lowercase for consistency.
  • Number removal: Strips out numerical digits.
  • Punctuation removal: Eliminates punctuation marks.
  • Whitespace normalization: Ensures consistent spacing between words.
  • Accent removal: Converts accented characters to their non-accented versions.
  • Stopwords removal: The function filters out these stopwords from the text, keeping only the meaningful words.

Step 5: Compute similarity using Fuzzy matching

In this step, we use the fuzzywuzzy library to measure how closely the cleaned SEO metadata (titles, meta descriptions, and headers) matches the top-performing search queries. This helps us identify areas where the content might not be optimized for relevant search terms.

We use token_set_ratio from the fuzzywuzzy library, which compares strings based on their content, ignoring the order and repeated words. This method is ideal for analyzing how well the cleaned text matches the search queries, as it provides a robust similarity score.

columns = [‘title_clean’, ‘meta_clean’, ‘h1_clean’]

for col in columns:
    similarity = []
    for index, row in df.iterrows():
        sim = fuzz.token_set_ratio(row[‘query_clean’], row[col])
        similarity.append(sim)
    df[f'{col}_similarity’] = similarity

# Rename columns for clarity
df.rename(columns=lambda x: x.replace(‘_clean_similarity’, ‘_similarity’) if x.endswith(‘_clean_similarity’) else x, inplace=True)
columns_to_drop = [col for col in df.columns if ‘_clean’ in col]
df.drop(columns=columns_to_drop, inplace=True)

We loop through each row of the DataFrame, comparing the cleaned search query (query_clean) with each cleaned SEO element (title, meta, h1).

For each comparison, we calculate a similarity score using token_set_ratio. The score ranges from 0 to 100, where 100 indicates an exact match.

The similarity scores are stored in new columns (title_similarity, meta_similarity, h1_similarity) for each corresponding SEO element.

AD 4nXehcGYQs7CwvVRfidRJU8eA NiAHQcL3b1TjtmIRlM7BKLvZ7zbWg 7nJuukp5Eu5u8MmyAz8ne08Meo8QZzKObKJEybxKco6g0bG1 l2sdhfTEP2J126CON6Epruk5r2EKzJZJE8MM K86DB6f9Wn6PDiR?key=suX6XBedEA9iYsumR0rtXw

Step 6: Generate new titles using OpenAI (Optional)

In this optional step, we aim to enhance our website’s relevance by generating new titles for pages where the existing titles don’t align well with top-performing queries. This involves using the OpenAI API to automatically create more relevant and optimized SEO content.

filtered_df = df[df[‘title_similarity’] <= 60]

We filter the DataFrame to identify pages where the similarity score between the current title and the top-performing query is 60 or below. These low scores indicate a poor match and an opportunity for improvement.

You can adjust the similarity threshold based on your observations and needs. If the results suggest that a different threshold is more effective, feel free to modify it.

# Configure OpenAI API
client = OpenAI(api_key=”YOUR OPENAI TOKEN HERE”)

MODEL = “gpt-4o”
TEMPERATURES = [0.0]

Use your OpenAI API key to set up the client and specify the model and temperature for generating content. The temperature controls the randomness of the output, with lower values producing more deterministic results.

PROMPT_TEMPLATE = “””Task: Generate a new title that meets the following criteria:
1. The title cannot be longer than 60 characters.
2. The title must include the specified query, represented in the template as ‘{query}’.
3. The title must be written in the specified language, represented in the template as ‘{lang}’.
4. No words should be repeated in the title.
5. Use the entities found in the given ‘current title’ ‘{title}’ to construct the new title.
Please ensure that your title meets all of the criteria.
“””

This template guides the AI to generate a new title that includes the top query, is concise, and maintains the original title’s context. You can modify this prompt if needed to suit your specific requirements or preferences.

Using the OpenAI API may incur costs, especially with large datasets, so consider your budget and usage limits.

# Function to generate a new title using OpenAI
def generar_titulo(prompt, temperature):
    completion = client.chat.completions.create(
        model=MODEL,
        temperature=temperature,
        messages=[{“role”: “user”, “content”: prompt}],
    )
    return completion.choices[0].message.content.strip()

# Generate new titles for each filtered row
for index, row in filtered_df.iterrows():
    query = row[‘query’]
    current_title = row[‘title’]
    prompt = PROMPT_TEMPLATE.format(query=query, lang=lang, title=current_title)
    new_title = generar_titulo(prompt, TEMPERATURES[0])
    df.at[index, ‘new_title’] = new_title

# Function to strip quotes from the beginning and end of a string
def strip_quotes(title):
    if title.startswith(‘”‘) and title.endswith(‘”‘):
        return title[1:-1]
    return title

# Apply the function to the ‘new_title’ column
df[‘new_title’] = df[‘new_title’].apply(strip_quotes)

# Save the results to a CSV file
df.to_csv(‘fuzzy_booster.csv’, index=False)

# Download the CSV file
files.download(‘fuzzy_booster.csv’)
AD 4nXdFyZjgv7Fded2wcGpVzkMFft7w5sE1ESScZ7xweP B9tkaS3rZ5iR9cYtuzZx29acFHsH6VH5B9D3DCPEFylYbdIpaoyaYPkHftu HN9lPvsk8 2m5JyOPnoMsOydmKc9N 9IHTz4ONCPvkG05F7D4cI ?key=suX6XBedEA9iYsumR0rtXw

You can apply the same process to h1 tags and meta descriptions. Simply adjust the filtering and prompts to target these elements.

If necessary, you can apply this method to any other content scraped from the page, such as subheadings or paragraph text, by adjusting the prompt and process accordingly.

Benefits of using Fuzzy Booster with FuzzyWuzzy and OpenAI in Google Colab

Here’s how it can make a significant difference in your SEO strategy:

  • Time savings and efficiency: By automating the analysis and optimization of SEO metadata, you can save hours of manual work. This allows you to focus more on strategic planning and less on repetitive tasks.
  • Data-Driven insights: Fuzzy Booster provides clear, actionable insights by identifying which SEO elements are underperforming. The use of similarity scoring helps highlight opportunities for optimization that might otherwise be overlooked.
  • Improved SEO performance: By focusing on non-branded keywords, you can boost your site’s visibility to new audiences. Optimized metadata increases click-through rates and enhances search rankings, driving more organic traffic to your site.
  • Flexibility and customization: The approach is adaptable, allowing you to adjust thresholds and prompts according to your specific needs. Whether you want to include branded queries or optimize specific page elements

Next Steps: Incorporate Fuzzy Booster into your organic strategy

Now that you’ve completed this tutorial, here are some ideas on how to incorporate the Fuzzy Booster method into your broader organic strategy:

  • Apply to other content types: Beyond just titles and meta descriptions, consider using this approach for other content on your site, like blog headings or product descriptions. The principles can also apply to external content platforms, such as optimizing YouTube video titles or organizing forum posts.
  • Continuous improvement: Use Fuzzy Booster as a regular part of your SEO audits to continually enhance your metadata. Regularly update your data collection and analysis processes to adapt to changing search patterns.

Author

  • Natzir Ruiz MLforSEO Author

    Natzir is a recognized Online Marketing Expert, renowned for his work as a consultant for the web portals of top-tier global brands. He has led SEO and CRO strategies for major multinational corporations and startups, orchestrating successful campaigns in these areas. He is distinguished by his holistic approach to SEO, where he applies his knowledge of data science, technical expertise, and consumer psychology. This methodology has enabled him to successfully uplift projects where others have not been able to. As a business school professor, he imparts his knowledge and experience to the next generation of marketing professionals. Additionally, he is a speaker at major industry events, sharing his insights into the evolving landscape of online marketing.

    View all posts

Share this post on social media:


Leave a Reply

Your email address will not be published. Required fields are marked *