Ngram Analysis

Owner

Marc Pfeiffer

Tags

Python

Last edited time

Sep 29, 2023 1:42 PM

Objective: The objective is to perform an ngram analysis on either a bulk file or an aggregated historical customer search term report file. This analysis will help identify single word phrase match candidates that can be excluded from campaigns.

The excel sheet contains the following column headers:

Column A: "Row Labels" which represents the query from the customer search term report
Column B: "Sum of Impressions"
Column C: "Sum of Clicks"
Column D: "Sum of Spend"
Column E: "Sum of 7 Day Total Sales"
Column F: "Sum of 7 Day Total Units (#)"
Column G: "Sum of 7 Day Advertised SKU Sales"

These columns are saved in an excel file called "ngram analysis".

To implement this, follow the steps below:

Read the Excel file "ngram analysis" into a DataFrame.
Loop through each row and split the "Row Labels" column to obtain unique words.
For each unique word, aggregate the values of the other columns based on the presence of that word in "Row Labels".
Calculate the additional columns as mentioned.
Write the resulting DataFrame to a new Excel file named "ngram analysis - done".

import pandas as pd

# Read the Excel file into a DataFrame
df = pd.read_excel('ngram analysis.xlsx')

# Initialize an empty DataFrame to store the results
# Exclude the "Row Labels" column from the result DataFrame
result_columns = [col for col in df.columns if col != 'Row Labels']
result_df = pd.DataFrame(columns=result_columns).astype('float64')

# Loop through each row in the original DataFrame
for index, row in df.iterrows():
    # Split the "Row Labels" into unique words
    words = set(row['Row Labels'].split())

    # Loop through each unique word
    for word in words:
        # Check if the word is already in the result DataFrame
        if word in result_df.index:
            # Update the existing row with aggregated values
            for column in result_columns:
                result_df.at[word, column] += float(row[column])  # Ensure values are float
        else:
            # Add a new row to the result DataFrame
            result_df.loc[word] = [float(row[column]) if column != 'Row Labels' else 0.0 for column in result_columns]

# Calculate additional columns
result_df['ACOS'] = result_df['Sum of Spend'] / result_df['Sum of 7 Day Total Sales']
result_df['sSKU ACOS'] = result_df['Sum of Spend'] / result_df['Sum of 7 Day Advertised SKU Sales']
result_df['CPC'] = result_df['Sum of Spend'] / result_df['Sum of Clicks']
result_df['CTR'] = result_df['Sum of Clicks'] / result_df['Sum of Impressions']
result_df['CVR'] = result_df['Sum of 7 Day Total Units (#)'] / result_df['Sum of Clicks']

# Save the result DataFrame to a new Excel file
with pd.ExcelWriter('ngram analysis - done.xlsx', engine='openpyxl') as writer:
    result_df.to_excel(writer, index_label='Word', sheet_name='Sheet1')

The output will look something like this (single words hidden for privacy)

We can see the top word has spent $60 without any sales indicating that it’s a good candidate to negate as a Phrase Negative match to eliminate wasted ad spend. In looking at this data it’s also interesting to analyze the CTR - we can see that some of these bad words have a high CTR

Possible Errors that can come up:

Python is sensitive if a column has a space at the end of the text. For example if I add a space in the “Sum of 7 Day Total Sales” then it will cause a KeyError. The script is trying to access a column named 'Sum of 7 Day Total Sales' in result_df, but it seems this column doesn't exist, resulting in a KeyError. To resolve this, you need to ensure that the column name is spelled correctly and that it actually exists in result_df.