Automate your DataLayer tests with Selenium & Python

Let’s face it: Nobody enjoys DataLayer quality checks on regular basis. Most of us prefer developing a cool new feature or diving deeply into a new analysis. But to be able to do all of this while producing value, data quality is a must (“shit in, shit out”). Thus, it is crucial to know, whether all the pages in your CMS have the correct page attributes pushed into the DataLayer. To be able to automate the above and other test-cases, I’ve developed a handful of Python modules to automate such quality assurance task.

Requisite is, that you need to have Python installed on your machine (Python 3.10 or above)

The Setup:

  1. Retrieve the complete Sitemap.xml file for your website and parse the contents.
  2. Open every page included in your sitemap, do the necessary page interactions (e.g. granting consent)
  3. Retrieve the DataLayer and check for missing keys
  4. Create a list of all pages and the missing keys on each page and transform it into a CSV file
  5. Send the file to yourself as well as your website-manager

The following function retrieves the sitemap.xml file and returns a list of all URLs included. Optionally you can use the limit argument to cut the list’s length:

def get_sitemap(url: string, limit = None):
    """
    Returns list with all links, that are existend in sitemap.xml file
    """
    res = requests.get(url)
    raw = xmltodict.parse(res.text)
    data = [r["loc"] for r in raw["urlset"]["url"]]
    if limit:
        return data[:limit]
    else:
        return data

The next thing you want to do is take this list of URLs and use selenium to open each webpage and retrieve the DataLayer. This part is a little bit more tricky and needs to take some optional actions into account.

The URL argument is obvious – it’s the webpages URL, you want to visit to retrieve the DataLayer object from. The index tells the function what occurrence of a specific DataLayer event, you want to retrieve. If you have multiple scroll events and you want to check the DataLayer for the first one, the index is 0, for the third one it’s 2. The event argument is “None” by default. In this case, you get the object from the DataLayer, that matches the index argument. If you want to check e.g. page information, that is populated on load, you can often leave the event as it is and set the index to 0, as this information is often the first element. The navigation_steps argument is the fun part. It takes a list of instructions, that selenium shall execute to simulate user behavior on your webpage. You can click stuff, scroll through it or even submit a form. The list itself can look like this:

ex_steps = [
    {
        "css-selector": "#cmpbntyestxt",
        "wait": 3,
        "action": "click"
    },
    {
        "css-selector": "div.footer",
        "wait": 3,
        "action": "scroll"
    },
        {
        "css-selector": "h1",
        "wait": 3,
        "action": "scroll"
    }
]

The above example instructs selenium to click on a button in the consent manager, scroll to the page’s footer and then scroll back to the H1. The “wait” key passes the maximum amount of seconds selenium is allowed to wait for the element to appear on the page. My function only includes clicking as well as scrolling as possible actions, but feel free to add more. The last two argument instruct selenium to wait for your CMP to be loaded and visible. Just pass the css-selector for an element within the CMP in the cmp_selector argument and that’s it.

def get_datalayer(url: str, index: int = 0, event: str = "None", navigation_steps: list = None, wait_for_cmp: bool = True, cmp_selector: str = "#cmpwelcomebtncustom"):
    """
    Opens a webpage and returns a specific DataLayer object.
    The object can be selected either by index or by event-name.
    """
    # Set up the Chrome WebDriver
    options = webdriver.ChromeOptions()
    options.add_experimental_option(
        'excludeSwitches', 
        ['enable-logging']
        )
    driver = webdriver.Chrome()

    driver.get(url)
    time.sleep(2)
    if wait_for_cmp:
        WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, cmp_selector)))

    # Optional navigation through page
    if navigation_steps is not None:
        for step in navigation_steps:
            sel = step["css-selector"]
            el = WebDriverWait(driver, step["wait"]).until(EC.visibility_of_element_located((By.CSS_SELECTOR, sel)))
            if step["action"] == "click":
                el.click()
                print("clicked")
            elif step["action"] == "scroll":
                el.location_once_scrolled_into_view
                print("scrolled")
    data = {}
    WebDriverWait(driver, 30).until(
        lambda driver: driver.execute_script("return typeof dataLayer !== 'undefined' && dataLayer.length > 0")
    )
    if event == "None":
        dataLayer = driver.execute_script(f"return window.dataLayer[{index}];")
        data = dataLayer
    else:
        dataLayer = driver.execute_script("return window.dataLayer;")
        occ = []
        for ob in dataLayer:
            if "event" in ob and ob["event"] == event:
                occ.append(ob)
        data = occ
    return data[index]

To check for missing keys, one can use this function: All you need to do is define a list of keys, that are mandatory to be included in your DataLayer. The function returns all of these, that are missing in a list.

def check_datalayer_object(object, values):
    missing_values = []
    for key in values:
        if key not in object:
            missing_values.append(key)
    return missing_values

Before we put it all together, we do need to add a functionality to send an email (e.g. to yourself or someone managing your website). Of course you could use another communication tool as, like Slack or Microsoft Teams Webhooks. But since E-Mail communication is commonly available, I am sticking to it this time. Here is a generic function that sends an E-Mail using SMTP credentials of your mail-server. There are several posts/tutorials on how to get these for almost any provider.

import smtplib, time
from email.mime.application import MIMEApplication
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
from email.header import Header

def send_mail(email_address, subject, file_path = False, filename = "file.csv"):
    """
    Function to send report to target mail adress.
    Uses mail template 'template.html' if available in project folder.
    email_adress: Can either be string or list of e-mail adresses the mail shall be sent to.
    subject: Mail subject
    file_path: Optional, path to e-mail attachement file.
    """

    # Constants for SMTP Server
    host = "" # add your host
    port = 25
    sender_mail = "" # add the sender adress
    user = "" # add username
    pw = # add password

    # Create a MIMEMultipart object
    msg = MIMEMultipart()

    # Set the sender and receiver email addresses
    if isinstance(email_address, str):
        receiver = email_address
    else:
        receiver = ", ".join(email_address)
    msg['From'] = sender_mail
    msg['To'] = receiver

    # Set the subject and message body
    msg['Subject'] = Header(subject, 'utf-8')

    # Attach the file to the email
    if file_path:
        with open(file_path, 'rb') as f:
            attachment = MIMEApplication(f.read(), 'base64')
            attachment['Content-Disposition'] = 'attachment; filename=filename'
            msg.attach(attachment)

    # Send the email
    with smtplib.SMTP(host, port) as smtp:
        smtp.connect(host, port)
        smtp.ehlo()
        smtp.login(user, pw)
        print("smtp login success")
        smtp.sendmail(sender_mail, email_address, msg.as_string())
        print('sending mail succeeded')
        time.sleep(5)
        smtp.quit()

So let’s put it all together:

First we take all the above functions (except for the one for E-Mails) and put it into a functions.py file as a collection of utility functions:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time, requests, xmltodict

def get_datalayer(url: str, index: int = 0, event: str = "None", navigation_steps: list = None, wait_for_cmp: bool = True, cmp_selector: str = "#cmpwelcomebtncustom"):
    """
    Opens a webpage and returns a specific DataLayer object.
    The object can be selected either by index or by event-name.
    """
    # Set up the Chrome WebDriver
    options = webdriver.ChromeOptions()
    options.add_experimental_option(
        'excludeSwitches', 
        ['enable-logging']
        )
    driver = webdriver.Chrome()

    driver.get(url)
    time.sleep(2)
    if wait_for_cmp:
        WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, cmp_selector)))

    # Optional navigation through page
    if navigation_steps is not None:
        for step in navigation_steps:
            sel = step["css-selector"]
            el = WebDriverWait(driver, step["wait"]).until(EC.visibility_of_element_located((By.CSS_SELECTOR, sel)))
            if step["action"] == "click":
                el.click()
                print("clicked")
            elif step["action"] == "scroll":
                el.location_once_scrolled_into_view
                print("scrolled")
    data = {}
    WebDriverWait(driver, 30).until(
        lambda driver: driver.execute_script("return typeof dataLayer !== 'undefined' && dataLayer.length > 0")
    )
    if event == "None":
        dataLayer = driver.execute_script(f"return window.dataLayer[{index}];")
        data. = dataLayer
    else:
        dataLayer = driver.execute_script("return window.dataLayer;")
        occ = []
        for ob in dataLayer:
            if "event" in ob and ob["event"] == event:
                occ.append(ob)
        data = occ
    return data[index]

def get_sitemap(url: string, limit = None):
    """
    Returns list with all links, that are existend in sitemap.xml file
    """
    res = requests.get(url)
    raw = xmltodict.parse(res.text)
    data = [r["loc"] for r in raw["urlset"]["url"]]
    if limit:
        return data[:limit]
    else:
        return data

def check_datalayer_object(object, values):
    missing_values = []
    for key in values:
        if key not in object:
            missing_values.append(key)
    return missing_values

Then we create a file “send_mail.py” and add the mail-function as well as all necessary modules to it

And lastly, we create “main.py” to put it all together:

from datetime import datetime
from functions import get_datalayer, get_sitemap, check_datalayer_object
from send_mail import send_mail

dl_values = ['page_type', 'page_id', 'environment']
receivers = [] # add list of mail-receivers
sitemap_url = "https://www.amazon.com/sitemap.xml" # add your sitemap URL
date = datetime.now()
date_str = date.strftime('%Y-%m-%d')

def run_qa(url, dl_values):
    data = []
    sitemap = get_sitemap(url, limit = 10)
    for url in sitemap:
        dl = get_datalayer(url, index = 0, event = "None", navigation_steps = None, wait_for_cmp = True, cmp_selector = "#cmpwelcomebtncustom")
        missing_values = check_datalayer_object(dl, dl_values)
        result = {}
        result['url'] = url
        result['missing_values'] = missing_values
        data.append(result)
        result['date'] = date_str
        result['qa_values'] = dl_values
    df = pd.DataFrame.from_records(data)
    df[df['missing_values'].apply(lambda x: len(x)) > 0]
    df.to_csv('file.csv', sep=',', encoding='utf-8')
    send_mail(receivers ,"Test CMS datalayer QA", 'file.csv', 'filename.csv')

run_qa(sitemap_url, dl_values)

When you run main.py selenium opens the first 10 pages in your sitemap.xml file (due to the limit argument being “10”), checks the existence of the defined DataLayer keys, saves the missing ones to a file and sends it via E-Mail.

I hope this helps you in your journey to automating DataLayer QAs!

Feel free to check out the GitHub repo with all files: https://github.com/ramonseradj/static_cms_qa_public