Your cart is currently empty!
We just launched our courses for pre-order -> Buy access here ✨
Keyword extraction is primarily used in Natural Language Processing (NLP) to identify the most relevant or important words or phrases (keywords) from a document or a set of documents.
In digital marketing, and particular in SEO tasks like keyword research or content auditing, this technique can be really useful for reducing noise from the keyword, title, paragraph or otherwise to only take the one (or few) words that most robustly, semantically represent the text.
In this guide, I’ll show you how to work with KeyBERT, and how to use it for keyword clustering in the process of keyword research.
KeyBERT is a keyword extraction technique that uses BERT (Bidirectional Encoder Representations from Transformers) to generate relevant keywords from a given text. Unlike traditional keyword extraction methods that rely on statistical or linguistic approaches, KeyBERT leverages the powerful contextual embeddings of BERT to identify words or phrases that are most relevant to the content.
How KeyBERT works is, in short, it tokenizes the terms, extracts embeddings, and then it gives you the most important word. Here’s a quick breakdown of the process:
In simple terms, from every sentence, from every keyword, you have one word that is the most vital one. That’s what KeyBERT tries to identify.
I highly recommend learning directly from the person, who created KeyBERT, Maarten Grootendorst. Here are some great resources to get you started with KeyBERT (besides this tutorial, of course):
In the following section, we’ll go through how to work with KeyBERT with Python, using a demo Google Colab Notebook.
There really isn’t much you need to get started working with this API. But the basics apply:
First step is importing the model by doing a pypi installation. You can install mode models to configure the transformer and language backends you’d like to use.
You then import the model and create a shortcut for using it in functions.
from keybert import KeyBERT
kw_model = KeyBERT()
Once you have installed, imported KeyBERT into your Python script and initialized the model, provide your input text, and KeyBERT will extract keywords based on their semantic relevance.
The model evaluates candidates against the original text using cosine similarity, ensuring that the extracted keywords capture the essential themes.
#Basic usage - keyword extraction
from keybert import KeyBERT
doc = """
Supervised learning is the machine learning task of learning a function that
maps an input to an output based on example input-output pairs. It infers a
function from labeled training data consisting of a set of training examples.
In supervised learning, each example is a pair consisting of an input object
(typically a vector) and a desired output value (also called the supervisory signal).
A supervised learning algorithm analyzes the training data and produces an inferred function,
which can be used for mapping new examples. An optimal scenario will allow for the
algorithm to correctly determine the class labels for unseen instances. This requires
the learning algorithm to generalize from the training data to unseen situations in a
'reasonable' way (see inductive bias).
"""
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(doc)
print(keywords)
This approach not only provides relevant keywords but also offers flexibility in specifying keywords or allowing the model to generate them autonomously.
N-gram specified keyword extraction using KeyBERT allows for the identification of keywords at both the unigram and bigram levels. This approach enhances the contextual understanding of the text by capturing single words (unigrams) and pairs of consecutive words (bigrams) as keywords.
To extract unigrams using KeyBERT, you can specify the extraction of individual keywords while setting the top_n
parameter to control the number of keywords returned. Here’s how to implement unigram extraction:
#n-gram specified keyword extraction
kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 1), stop_words=None)
In unigram extraction, each individual word is treated as a separate keyword. This method is straightforward and allows for a quick analysis of the most semantically important terms within a document.
For bigram extraction, you can adjust the keyphrase_ngram_range
parameter to (2, 2) in KeyBERT, allowing the model to focus on pairs of consecutive words. This helps capture meaningful phrases in the text. Here’s an example:
kw_model.extract_keywords(doc, keyphrase_ngram_range=(2, 2), top_n=10, stop_words=None)
With KeyBERT for both unigram and bigram keyword extraction, you can enhance your analysis, capturing essential terms and phrases that contribute to a deeper understanding of the text. You can specify other lengths of the n-grams, as well and customise this implementation further.
To highlight the most important terms in a document, you can use KeyBERT’s extract_keywords
function with the highlight=True
parameter. This feature allows you to visually emphasize the relevant keywords directly within the text. Here’s a concise example:
#highlight keywords in the document
keywords = kw_model.extract_keywords(doc, highlight=True)
This makes it easier to identify key concepts and themes within the content at a glance, and help you identify if you’re stuffing keywords in your text. This can also be applied for the content of your competitors, but also for personal uses like email or social text analysis to improve readibility.
In this notebook, we’ll use the KeyBERT library to automatically extract keyphrases from a list of keywords. The keyphrases are derived from the text data using pre-trained BERT models, making this approach highly accurate for identifying meaningful keywords.
Here are the steps we’ll execute:
You can upload a CSV or Excel file that contains a column named Keywords. The notebook supports both file formats, so you can upload .csv, .xls, or .xlsx files directly.
2. Keyword Extraction
The KeyBERT model will analyze the Keywords column and extract
2. Get the results and download your csv file
After processing, two new columns will be added to the DataFrame:
The DataFrame, with the new columns, will be saved as a CSV file and automatically downloaded.
Click on the image below to go to the Colab notebook with the code to use.
Alternatively, feel free to copy and modify this code as you see fit.
# Install the necessary libraries
#!pip install keybert
#!pip install transformers
!pip install openpyxl # For handling Excel files
import pandas as pd
from keybert import KeyBERT
from google.colab import files
from google.colab import drive
# Initialize the KeyBERT model
kw_model = KeyBERT()
# Function to load a dataframe from a file uploaded by the user
def load_dataframe():
# User uploads either a CSV or Excel file
print("Please upload a CSV or Excel file:")
uploaded = files.upload()
filename = list(uploaded.keys())[0] # Get the name of the uploaded file
# Check the file extension and load accordingly
if filename.endswith('.csv'):
df = pd.read_csv(filename)
elif filename.endswith(('.xls', '.xlsx')):
df = pd.read_excel(filename)
else:
print("Unsupported file type. Please upload a CSV or Excel file.")
return None
return df
# Function to apply KeyBERT on the 'Keywords' column
def apply_keybert(df):
if 'Keywords' not in df.columns:
print("Error: The dataframe must contain a column named 'Keywords'.")
return None
# Create new columns for unigrams and bigrams
def extract_ngram(text, ngram_range):
# Extract keywords with specified ngram range, handle the case where no keywords are found
keywords = kw_model.extract_keywords(text, keyphrase_ngram_range=ngram_range, stop_words='english')
return keywords[0][0] if keywords else "" # Return the keyword or an empty string if none found
# Apply to the 'Keywords' column
df['Core (1-gram)'] = df['Keywords'].apply(lambda x: extract_ngram(x, (1, 1)) if len(x) > 0 else "")
df['Core (2-gram)'] = df['Keywords'].apply(lambda x: extract_ngram(x, (2, 2)) if len(x) > 0 else "")
return df
# Main function to upload the file and apply the transformations
def main():
df = load_dataframe()
if df is not None:
# Apply KeyBERT to extract keywords
df_with_keybert = apply_keybert(df)
if df_with_keybert is not None:
# Show the modified dataframe
print(df_with_keybert.head())
# Save the modified dataframe to a new CSV
df_with_keybert.to_csv('keywords_with_keybert.csv', index=False)
print("File saved as 'keywords_with_keybert.csv'.")
files.download('keywords_with_keybert.csv')
# Call the main function
main()
After uploading your file, the notebook will automatically apply KeyBERT to extract keywords and download the processed file.
Provided that your input looks like the image on the left, your downloaded file will look like the image on the right, with two columns added: Core (1-gram) and Core (2-gram).
You can then visualise this data, or use it for keyword or page tagging purposes.
Ideally, what you’d like to do with this data would be to incorporate the keyword labels into your keyword research or semantic keyword universe, so that your visualisations can reflect other metrics (not only count of keywords per cluster) like
In this KeyBERT tutorial, we explored the powerful capabilities of KeyBERT for keyword extraction, text analysis, and keyword clustering. Here are the key takeaways:
extract_keywords
function with the highlight=True
parameter enables users to visually emphasize important terms within their documents, enhancing readability and focus.By analyzing keyword performance, competition, and contextual relevance, you can develop more effective content optimization strategies that align with user intent and market demands.
Overall, KeyBERT serves as a robust tool for keyword extraction and analysis, empowering marketers, content creators, and SEO professionals to enhance their strategies through data-driven insights.
Lazarina Stoy.
Beginner FuzzyWuzzy Google Cloud Natural Language API Google Colab (Python) Google Sheets (Apps Script) Intermediate KeyBERT OpenAI API Whisper API
Share this post on social media:
Leave a Reply