How to use generative AI with structured data for programmatic SEO

Elias Dabbas - Using generative AI with structured data for programmatic SEO

Prompt engineering is one of the hot topics these days, and it is constantly changing due to rapid changes in the guidelines, the underlying technologies’ capabilities, as well as our understanding of them.

It is mostly known as making many attempts, changing, rewriting a prompt, until you find one that works for the task at hand.

I would like to build on this and share a pattern of building prompt templates that can be used in bulk, for the creation of product descriptions, email marketing messages, and any other type of content.

The key enabler in this pattern is the availability of structured data for many items, and for a consistent set of attributes.

N.B. Structured data in this case is referred to data that exists in a database (regardless of platform), and is structured in an Entity – Attribute – Value (EAV) model. In the example below, the entity is the product, the attributes are the size, color and price, and the variables are the numbers in the cell.

import pandas as pd
products = pd.DataFrame({
    'product': ['Product A', 'Product B', 'Product C'],
    'size': [10, 12, 15],
    'color': ['blue', 'green', 'red'],
    'price': [150, 180, 125]
})
products
productsizecolorprice
Product A10blue150
Product B12green180
Product C15red125

A very simple prompt for a product description for one of those products can be something like this:

Please create a product description for Product A Mention its size, which is 10 Also mention its color, which is blue And talk about its price: 150 The tone of voice should be professional

Now we have several important things to consider:

  • Scalability – We would like to scale our prompting to encompass all our products.
  • Consistency – The consistency of our data enables us to run the same process across many products.
  • Reproducibility – It would be great if we can save the process that generated our product descriptions. This is in order to debug issues, reproduce them again, and eventually update them by tweaking the prompt or by adding new product attributes whenever we have updated data.

How to create prompts in bulk (a sample approach)

To continue the example, for our simple product dataset, we can create a simple for-loop that goes through the products and dynamically inserts the product’s attributes to the prompt:

prompt_template = """
Please create a product description for {product}.
Mention its size, which is {size}
Also mention its color, which is {color}
Talk about its price: {price}
The tone of voice should be professional
"""

for row in products.to_dict('records'):
    print(prompt_template.format_map(row))

Please create a product description for Product A.
Mention its size, which is 10
Also mention its color, which is blue
Talk about its price: 150
The tone of voice should be professional


Please create a product description for Product B.
Mention its size, which is 12
Also mention its color, which is green
Talk about its price: 180
The tone of voice should be professional


Please create a product description for Product C.
Mention its size, which is 15
Also mention its color, which is red
Talk about its price: 125
The tone of voice should be professional

Let’s take a quick look at some of the advantages of this approach:

ReproducibilityEventually we will most likely want to update those descriptions. Obviously this is not a good prompt, doesn’t provide much context, and can be improved in many ways. We might have a new SEO strategy and want to focus on other aspects of our products. We might get new data, like user ratings for example, and might want to incorporate a new line in our prompt, and so on.
ScalabilityThis approach clearly allows us to create descriptions on a large scale, provided we have consistent data (n.b. think attributes and variables for each entity), and that the items are similar (belonging to the same product category, or are of the same type).
DebuggingIf we see something wrong in our product descriptions, or some unwanted behavior, we can always go back and update our prompt template and clarify in a better way what we want. This enables us to start with any prompt, test for a few descriptions and keep iterating until we are satisfied. Then, we can run the prompt template across five thousand descriptions.
Minimizing hallucinationsTrusting LLMs to create product descriptions for you is risky business, because they are well-known for how easily they can invent “facts”, and also being “confident” about them. With this approach, we are restricting the factual information to the data that we provide, and we are also utilizing the great language processing capabilities of LLMs. We can also explicitly mention in the prompt something like, “only use the data provided.”

Now that we have clarified the approach and its benefits, let’s see how this might be used in a real-life situation.

Project Overview

The example I will go through is for NBAstats.pro a website for NBA player statistics as the name suggests. It’s an experimental project for testing this approach and seeing how useful it can be for users, and how it might rank on search.

We will connect to the NBA data API, obtain career data for each player (per season and per metric), and use that to create profile pages for each player.

The main keyword template we are targeting here is “<player_name> stats”.

We will then build a prompt template that populates the player data dynamically into different prompts, and pass these on to an LLM – in this case, Open AI’s GPT.

The output will contain the prompt generated, the stats of the player that were passed on in the prompt, and the generated description that the LLM created for each of our players.

The Why – added value of using LLMs for product descriptions

Despite showing the process with a dummy dataset of NBA players, this approach is suitable for any programmatic SEO project with a well-structured database, and can be used for creating entire pages, if there is a robust template for each page. For e-commerce website, this approach can be a great way to scale and speed-up the writing process of product descriptions .

What we are doing here is not a simple re-arrangement of what someone else wrote and calling it our own. This would not add real value, and would not be an ethical thing to do. We are using publicly available data to uncover some insights, and with those facts creating an interesting description. This description should make the reader better informed about the topic discussed in two main ways:

Insights pulled to the surface from the raw data

With the availability of structured data about our players, we can make some calculations, and uncover some interesting facts that might not be immediately visible from the raw data.

For example, the raw data show how many points they scored every season, as well as how games they played. We can easily calculate the points per game and provide more insight than what the simple data might suggest. We can take this further by taking those totals or averages, and comparing them to all other players in the NBA, “Player A scored a total of X points in his career, putting him in the top Z players of all time.”

Charts and visualisations to aid data comprehension

Another added value we can add to the raw data is that we can visualize the data for truly more insightful perspectives into the players’ careers. We are targeting who are specifically searching for player stats, and they would most likely appreciate more insights into those stats through some visualizations.

How easy is it to understand Michael Jordan’s career stats by looking at this table?

image

Maybe this is better:

image 1

Was it easy for you to spot the two seasons where he played a very low number of games, and that otherwise he played pretty much all games of all seasons?

Was it clear from the table, that after his third year, his total points per season kept going down throughout his career (even though the games he played was stable, with two exceptions)?

Of course those charts are not produced by the LLM. We will produce them with Plotly, a Python data visualisation library, and of course we can use other libraries for that.

From an SEO perspective, we are also producing a lot of image assets which might hopefully also rank on image search for related terms. In this website, we have 8 charts per player, which is almost forty thousand charts in total. This is real content based on actual data, and we are using it to provide additional insights to our users.

How to use LLMs to create descriptions programmatically: Step-by-Step Guide

Just like what we did with the toy example above, we will now create a prompt template for NBA players.

Import libraries

import pandas as pd
from openai import OpenAI
from nba_api.stats.endpoints import playercareerstats
from nba_api.stats.static import players, teams
import plotly.express as px
pd.options.display.max_columns = None
import os
from IPython.display import display_markdown
def md(text):
    return display_markdown(text, raw=True)
client = OpenAI(api_key=os.environ['OPENAI_API_KEY'])

Get static data about players and teams (ID’s, full name, city, state, etc.)

These are two fairly stable datasets containing simple high-level data on players and teams.

players_static = players.get_players()
pd.DataFrame(players_static).to_csv('players_static.csv', index=False)
teams_static = teams.get_teams()
pd.DataFrame(teams_static).to_csv('teams_static.csv', index=False)
players_static = pd.read_csv('players_static.csv')
teams_static = pd.read_csv('teams_static.csv')
players_static.sample(10)
idfull_namefirst_namelast_nameis_active
339377825George PearcyGeorgePearcyFalse
18491538Cedric HendersonCedricHendersonFalse
407722Rik SmitsRikSmitsFalse
282977517Paul McCrackenPaulMcCrackenFalse
3022600006Earl MonroeEarlMonroeFalse
432478340Mel ThurstonMelThurstonFalse
34602565Zoran PlaninicZoranPlaninicFalse
1782203914Gary HarrisGaryHarrisTrue
42203128Furkan AldemirFurkanAldemirFalse
109476577Terry DischingerTerryDischingerFalse
teams_static.sample(10)
idfull_nameabbreviationnicknamecitystateyear_founded
131610612750Minnesota TimberwolvesMINTimberwolvesMinnesotaMinnesota1989
151610612752New York KnicksNYKKnicksNew YorkNew York1946
271610612764Washington WizardsWASWizardsWashingtonDistrict of Columbia1961
211610612758Sacramento KingsSACKingsSacramentoCalifornia1948
61610612743Denver NuggetsDENNuggetsDenverColorado1976
91610612746Los Angeles ClippersLACClippersLos AngelesCalifornia1970
191610612756Phoenix SunsPHXSunsPhoenixArizona1968
141610612751Brooklyn NetsBKNNetsBrooklynNew York1976
31610612740New Orleans PelicansNOPPelicansNew OrleansLouisiana2002
11610612738Boston CelticsBOSCelticsBostonMassachusetts1946
players_static[players_static['full_name'].isin(['LeBron James', 'Michael Jordan'])]
idfull_namefirst_namelast_nameis_active
21112544LeBron JamesLeBronJamesTrue
2291893Michael JordanMichaelJordanFalse

Get player career stats using the player’s ID

The playercareerstats endpoint is responsible for providing all the data for each player. For this we need the player’s ID which we can get from the previously obtained players_static table.

lebron_james = playercareerstats.PlayerCareerStats(player_id=2544)
michael_jordan = playercareerstats.PlayerCareerStats(player_id=893)
lebron_james.get_data_frames()[0].head()
PLAYER_IDSEASON_IDLEAGUE_IDTEAM_IDTEAM_ABBREVIATIONPLAYER_AGEGPGSMINFGMFGAFG_PCTFG3MFG3AFG3_PCTFTMFTAFT_PCTOREBDREBREBASTSTLBLKTOVPFPTS
025442003-04001610612739CLE19.079793120.062214920.417632170.2903474600.75499333432465130582731491654
125442004-05001610612739CLE20.080803388.079516840.4721083080.3514776360.750111477588577177522621462175
225442005-06001610612739CLE21.079793361.087518230.4801273790.3356018140.73875481556521123662601812478
325442006-07001610612739CLE22.078783190.077216210.476993100.3194897010.69883443526470125552501712132
425442007-08001610612739CLE23.075743027.079416420.4841133590.3155497710.712133459592539138812551652250
michael_jordan.get_data_frames()[0].head()
PLAYER_IDSEASON_IDLEAGUE_IDTEAM_IDTEAM_ABBREVIATIONPLAYER_AGEGPGSMINFGMFGAFG_PCTFG3MFG3AFG3_PCTFTMFTAFT_PCTOREBDREBREBASTSTLBLKTOVPFPTS
08931984-85001610612741CHI22.082823144.083716250.5159520.1736307460.845167367534481196692912852313
18931985-86001610612741CHI23.0187451.01503280.4573180.1671051250.8402341645337214546408
28931986-87001610612741CHI24.082823281.0109822790.48212660.1828339720.8571662644303772361252722373041
38931987-88001610612741CHI25.082823311.0106919980.5357530.1327238600.8411393104494852591312522702868
48931988-89001610612741CHI26.081813255.096617950.53827980.2766747930.850149503652650234652902472633

Create a prompt template and dynamically insert each player’s data

Just like we did with the simple prompt template above, we can do the exact same thing we did above. Note that since we have more data and of different types, the loop is not as straightforward.

For example, for most columns we get the sum of the column to get the player’s career stats for that metric (games played, steals, blocks, etc.). In some cases, like seasons played, and teams played for, it doesn’t make sense to sum these values.

responses = []
for player_id in [2544, 893, 203999, 76003]:
    player_stats_df = playercareerstats.PlayerCareerStats(player_id=player_id).get_data_frames()[0]
    player_stats_df = pd.merge(
        player_stats_df,
        players_static,
        left_on='PLAYER_ID',
        right_on='id',
        how='left')
    player_stats_df = pd.merge(
        player_stats_df,
        teams_static,
        left_on='TEAM_ID',
        right_on='id',
        how='left')
    display(player_stats_df)

    # player_id = int(player_id.replace('player_', '').replace('.csv', ''))
    # t0 = time.time()
    try:
        df = player_stats_df[player_stats_df['PLAYER_ID'].eq(player_id)]
        if (df['FGM'].sum() > 0) and  (df['FGA'].sum() > 0):
            goals_percent = df['FGM'].sum() / df['FGA'].sum()
        else:
            goals_percent = 'unknown'
        if (df['FTM'].sum() > 0) and (df['FTA'].sum() > 0):
            freethrows_percent = df['FTM'].sum() / df['FTA'].sum() > 0
        else:
            freethrows_percent = 'unknown'
        d = dict(
            name = df['full_name_x'].iloc[0],
            teams = df['full_name_y'].drop_duplicates().dropna().tolist(),
            cities = df['city'].drop_duplicates().dropna().tolist(),
            states = df['state'].drop_duplicates().dropna().tolist(),
            active = df['is_active'].iloc[-1],
            start_season = df['SEASON_ID'].iloc[0],
            end_season = df['SEASON_ID'].iloc[-1],
            start_age = df['PLAYER_AGE'].iloc[0],
            end_age = df['PLAYER_AGE'].iloc[-1],
            games_played = df['GP'].sum(),
            minutes_played = df['MIN'].sum(),
            goals_attempted = df['FGA'].sum(),
            goals_made = df['FGM'].sum(),
            goals_percent = goals_percent,
            freethrows_attempted = df['FTA'].sum(),
            freethrows_made = df['FTM'].sum(),
            freethrows_percent = freethrows_percent,
            rebounds = df['REB'].sum(),
            rebounds_def = df['DREB'].sum(),
            rebounds_off = df['OREB'].sum(),
            assists = df['AST'].sum(),
            steals = df['STL'].sum(),
            blocks = df['BLK'].sum(),
            points = df['PTS'].sum(),
    )
        prompt = [
            {"role": "system",
             "content": """
You are a smart, detail-oriencted, keen NBA Basketball player analyst.
Please write an introductory text for a profile page of this Basketball player.
Length: 500 - 800 words.
please stick to the stats provided.
Tone: should be interesting factual intriguing and inviting the user to dive
into the charts on the page to better get to know the player."""},
             {"role": "user",
              "content": f"""
Please write and article for the player using the following details:
{d}"""}]
        display(md(f"## {df['full_name_x'].iloc[0]}"))
        for k, v in prompt[0].items():
            print(k, v)
        print()
        for k, v in d.items():
            print(k, v)
    except Exception as e:
        print(str(e))
        continue
    completion = client.chat.completions.create(
      # model="gpt-4o",
      model="gpt-3.5-turbo",
      messages=[
        {"role": "system",
         "content": """You are a smart, detail-oriencted, keen NBA Basketball player analyst. \
         Please write an introductory text for a profile page of this Basketball player.
         Length: 500 - 800 words.
         please stick to the stats provided.
         Tone: should be interesting factual intriguing and inviting the user to dive \
         into the charts on the page to better get to know the player.
         """},
        {"role": "user",
         "content": f"""Please write and article for the player using the following details:
         {d}
         """}
      ]
    )
    responses.append([player_id, completion])
    print()
    print(completion.dict()['choices'][0]['message']['content'])
    print()
    display(md('----'))

The system response would look like this – first summarising the generated prompt, then the player stats, then followed by the generated text. The below is an example for the player Nikola Jokic.

role system
content 
You are a smart, detail-oriencted, keen NBA Basketball player analyst.
Please write an introductory text for a profile page of this Basketball player.
Length: 500 - 800 words.
please stick to the stats provided.
Tone: should be interesting factual intriguing and inviting the user to dive
into the charts on the page to better get to know the player.

name Nikola Jokic
teams ['Denver Nuggets']
cities ['Denver']
states ['Colorado']
active True
start_season 2015-16
end_season 2023-24
start_age 21.0
end_age 29.0
games_played 675
minutes_played 21077.0
goals_attempted 9778
goals_made 5450
goals_percent 0.5573736960523624
freethrows_attempted 3099
freethrows_made 2563
freethrows_percent True
rebounds 7249
rebounds_def 5466
rebounds_off 1783
assists 4667
steals 822
blocks 491
points 14139

Nikola Jokic: The Game-Changing Force from the Mile High City

When it comes to the realm of basketball, few players possess the unique blend of skills and basketball IQ that Nikola Jokic brings to the court. As a key player for the Denver Nuggets since the 2015-2016 season, Jokic has not only solidified his place as one of the league's top players but has also become a fan favorite in the city of Denver, Colorado.

Standing at 7 feet tall, Jokic cuts an imposing figure on the court, but it's his finesse and versatility that truly set him apart from his peers. With a stellar field goal percentage of 55.7%, Jokic has proven time and time again that he has the scoring touch to make an impact in any game situation. His ability to stretch the floor as a center, with a reliable mid-range shot and three-point range, makes him a nightmare matchup for opposing defenders.

But Jokic's impact goes beyond just scoring. His passing ability and court vision are arguably what truly elevate his game to the next level. With an impressive 4667 assists over his career, Jokic is not just a scorer - he is a playmaker, constantly creating opportunities for his teammates and keeping the Nuggets' offense flowing smoothly.

On the defensive end, Jokic is equally impactful. With 7249 total rebounds, including an impressive 1783 offensive rebounds, Jokic is a force on the boards, giving his team crucial second-chance opportunities and limiting opponents' offensive possessions. His 822 steals and 491 blocks also showcase his ability to disrupt plays and protect the rim when called upon.

As a truly all-around player, Jokic's impact on the court cannot be overstated. His versatility, basketball IQ, and unique skill set make him a player that opposing teams must always account for and game-plan against. It's no wonder that he has become the face of the Denver Nuggets and a player that fans in the Mile High City have come to adore.

Through 675 games played and over 21,000 minutes on the court, Jokic has consistently proven himself as a reliable and game-changing presence for the Nuggets. His numbers speak for themselves - with 14139 points scored and counting, Jokic continues to be a dominant offensive force for his team.

At just 29 years old, Jokic is still in the prime of his career, with plenty of basketball ahead of him. As he continues to evolve and refine his game, there's no telling how far he can take the Nuggets and what accolades he can achieve in the years to come.

For fans of the game looking to witness a player who truly embodies the essence of modern basketball - a player who can score, pass, rebound, and defend with equal prowess - look no further than Nikola Jokic. Dive into the stats, dissect his game, and marvel at the skill and talent of this basketball virtuoso from Denver.

Review the generated texts

Now let’s review the generated text for some of the key players:

LeBron James

Screenshot 2024 10 08 at 22.31.16

Michael Jordan

Screenshot 2024 10 08 at 22.33.13

Summary

This was a quick overview of how we can use LLMs to create content in bulk. The sole requirements being having clean, correct, and consistent data, in a structured format.

This enables us to scale the content creation process to make articles or product descriptions.

The fact that we have the full process of creating the descriptions empowers us even more because we can always recreate the content if and when the data changes, as will definitely happen with NBA stats, or if we want to change the tone of voice, focus of the article, or change any aspect of it.

Author

  • Ellias Dabbas MLforSEO Contributor

    Elias is a digital marketing and Data Science practitioner. He is the creator and maintainer of the Python library advertools, and is the author of the book Interactive Dashboards and Data Apps with Plotly and Dash. He was motivated to start learning and implementing ML in Organic Search by the possibilities of discovering trends, insights, as well as improving his overall productivity. Some of his most successful MLforSEO projects include: fine-tuning a model that extracts entities from articles and gets their Wikipedia URLs, which was later also implemented as an interactive app. Some notable work Elias has done in ML/SEO includes nbastats.pro project and the adver.tools suite of ML technologies.

    View all posts

Share this post on social media:


One response to “How to use generative AI with structured data for programmatic SEO”
  1. Suresh kumar Gondi Avatar

    Elias, it’s a great way of using the templates of prompt for each player and this opens up a lot of ideas.

    We can add up a lot of variables to the Dataframes and we can customise the prompts too.

    Curious question, Which model gives a good output of text? Have we controlled hallucination too with this idea?

Leave a Reply

Your email address will not be published. Required fields are marked *