The Problem with Building Test Automation Around Limited AI Coding Sessions

When a team first tries to use AI to build automated tests, the workflow often feels almost unfairly productive. You describe a flow, the assistant writes the script, you paste it into your repo, and a browser test appears faster than a human could have scaffolded it. That early speed is real. The problem is that it can hide a structural mismatch between how modern test suites grow and how limited AI coding sessions work.

Once the suite expands, the work stops being about generating one more test file. It becomes about reading old tests, understanding shared utilities, debugging flaky selectors, tracing CI failures, and making changes across a system that now has history. That is exactly the point where limited AI coding sessions start to break down as a foundation for limited AI coding sessions test automation. The team needs continuity, memory, and repeatable maintenance behavior, but the workflow is built around short-lived bursts of context.

This is not an argument against AI-assisted development. It is an argument against treating AI coding sessions as the system of record for Test automation.

What the workflow looks like at the beginning

In the first phase, AI coding assistants can look perfect for browser automation. A developer or QA engineer provides a user story, perhaps something like:

open the login page
enter credentials
verify the dashboard loads
assert the account name is visible

A session-based coding assistant can often generate a Playwright or Selenium test quickly. If you are using Playwright, the resulting script may even look polished enough to commit with minimal edits. If you are on Selenium, the assistant can produce the basic driver interactions, waits, and assertions in a few minutes.

That is where teams can get overconfident. The first few tests are usually simple, local, and highly contextual. The page structure is familiar, the selectors are easy, and the generated code only needs to satisfy a narrow slice of the application.

The hidden assumption is that the same workflow will still work when the suite becomes a real asset instead of a demo.

Why limited sessions become fragile as the suite grows

The phrase “limited AI coding sessions” sounds like an inconvenience, but in practice it creates a maintenance model with hard edges. As the codebase grows, each change requires more context than a fresh session can comfortably hold.

1. The assistant does not inherit your test system mentally

A browser test suite is not just a collection of scripts. It is an accumulation of conventions, helpers, fixtures, environment variables, page abstractions, locator strategies, retry policies, and CI behavior. Even a tidy Playwright setup may include:

custom fixtures
authentication state reuse
shared page objects
environment-specific data setup
flaky test workarounds
special handling for shadow DOM, iframes, or dynamic lists

A limited session can inspect part of this structure, but it often cannot reason about all of it at once. Once the suite reaches a few dozen tests, changes ripple. A selector fix in one place may reveal that three other tests were silently relying on the old flow. A new abstraction may conflict with existing helper patterns. A fresh session can miss those relationships because they are distributed across files and history.

2. Debugging is stateful, but the session is not

The biggest maintenance cost in browser automation is often not writing the first version of the test. It is figuring out why a test failed after a UI change, an environment change, or a timing issue.

A limited AI coding session usually sees one failure at a time. But a real test failure often needs layered reasoning:

Is the failure due to a locator change?
Did the app render slowly in CI but not locally?
Did a prior step leave the app in the wrong state?
Is the failure caused by test data pollution?
Was the failure introduced by a recent helper refactor?

That is where limited AI coding assistant limits become operational, not just theoretical. The session has to repeatedly reconstruct context. If the assistant cannot hold enough of the system in working memory, the human ends up acting as the long-term memory for the tool.

The cost is not only the time spent prompting the assistant. The cost is the human time spent rebuilding context that the tooling should have preserved.

3. Maintenance work is not local to a single test

Small suites are deceptive because a single failing script can often be fixed in isolation. Large suites are different. A new element ID strategy, a component library upgrade, or a route rename can affect many tests at once. This is where test automation framework maintenance becomes more important than test generation.

If the suite is written in code, every change may require:

re-checking shared locators
updating page object methods
re-evaluating waits and retries
reviewing fixture assumptions
adjusting CI timeouts
re-running the entire shard or suite

That is a lot to ask from a workflow built around short sessions, especially when the same assistant is also expected to generate new tests, fix old ones, explain failures, and preserve style consistency.

The scaling problem is architectural, not just ergonomic

A limited-session AI workflow is appealing because it feels like a better editor. But test automation is not just editing. It is software operation.

If the automation layer is code-first, every test is part of a code maintenance surface. That means the team is implicitly choosing a framework-centered architecture. In that model, AI is a productivity layer on top of a codebase. It does not remove the need to understand the framework, manage the abstractions, or debug the runtime behavior.

That is fine if the organization wants to own that complexity. It is not fine if the team believes AI sessions will absorb it for them.

The risk shows up in predictable ways:

new hires struggle to understand the generated patterns
different engineers generate slightly different styles
locators drift and accumulate fixes
review quality drops because the test code is dense and repetitive
the suite becomes harder to trust, so teams stop using it as a release gate

This is the real danger of building around limited AI coding sessions test automation. The workflow optimizes initial creation, but the organization pays the long-term cost in comprehension, consistency, and maintenance.

Playwright and Selenium are not the problem, the ownership model is

It is easy to blame the framework. People complain about Playwright maintenance or Selenium flakiness, and those complaints are often justified. But the framework is usually only exposing the deeper issue: who owns the maintenance burden, and how is that burden reduced over time?

Playwright is strong for modern browser automation, especially when teams want expressive code and close control over the browser runtime. Selenium remains widely used because of its ecosystem and historical footprint. Both can support serious test programs.

The problem is not that code-based frameworks are incapable. The problem is that teams often pair them with a session-based authoring model and assume the authoring layer will keep up with the evolving codebase.

It will not, not reliably.

Once the suite is large enough, the work becomes less about writing test logic and more about sustaining a living automation system. That system needs reproducibility, stable abstractions, and a way to absorb UI churn without forcing each new change back through a fragile coding session.

Common failure modes when the suite grows

Locators become the maintenance tax

Most browser test breakage starts with locators. The CSS class changed, the button text changed, the DOM got restructured, or the test was relying on an implicit ordering that no longer exists. In a code-first workflow, the assistant may rewrite the selector, but the real question is whether the selector strategy is robust enough to survive the next change.

A one-off session can patch a broken locator. It cannot, by itself, enforce a locator philosophy across the whole suite.

Waits become folklore

Teams often accumulate timing fixes like cargo cults. Add a wait here, retry there, increase a timeout somewhere else. These patches make the suite pass today, but they hide the underlying instability. Limited AI sessions can help insert waits, but they are poor at deciding when a wait is a symptom and when it is a real synchronization requirement.

Test data and state drift

A browser test is rarely just browser automation. It is also a dependency chain into data setup, backend state, feature flags, email systems, and test accounts. As the suite grows, the probability of hidden coupling increases. A limited coding session may not see all of these dependencies, which means the fix may solve the visible symptom while leaving the root cause intact.

Refactors are expensive because the suite is code

A change to one shared helper can affect many tests. This is standard software engineering, but in testing it can be especially painful because the failure surface is broad and the business impact is often immediate. If the team is already approaching AI coding assistant limits, the refactor becomes a bottleneck right when the suite needs to evolve.

What CTOs, QA leaders, and founders should actually optimize for

The goal is not to maximize code generation. The goal is to maximize reliable coverage per unit of maintenance.

That means asking harder questions:

How much effort does a new test add to the long-term maintenance pool?
What happens when the UI changes in ten places at once?
Can non-developers meaningfully author or update tests?
Is the suite resilient enough to survive routine frontend iteration?
How much of the system depends on remembering framework-specific patterns?

If the answer to those questions is “the assistant will probably handle it,” that is not a maintenance plan.

A platform approach reduces the failure surface

This is where a platform model is materially different from a code-generation loop. Instead of treating test creation as a temporary coding session, the system becomes the place where the test lives, is edited, and is maintained.

That is the core reason we think Endtest is a stronger long-term approach for many teams. Endtest uses agentic AI to create tests inside an editable platform, so the output is not a pile of generated code that has to be reinterpreted later. The test becomes a standard, inspectable, platform-native asset.

That distinction matters because it changes where complexity lives.

In a code-first workflow:

the test lives in source code
maintenance requires framework knowledge
failures often require debugging both the app and the test code
each repair may depend on a fresh AI session understanding the full context

In a platform workflow like Endtest:

tests are created as editable steps in the platform
the authoring surface is shared by the team
maintenance is centralized in the testing system
the workflow is less exposed to AI coding session limits

This does not eliminate testing complexity, but it relocates it into a system designed for test authoring and execution rather than general-purpose coding.

Why editable platform-native tests age better than generated scripts

The main reason platform-native tests are easier to sustain is that they are closer to the intent of the test, and farther from the mechanics of the browser driver.

When a generated Playwright or Selenium script fails, a human often has to translate between the bug, the framework code, and the app behavior. That translation is where limited sessions struggle. When the test is editable in a platform, the maintenance task becomes more direct: inspect the steps, update the assertion, adjust the locator, rerun the flow.

Endtest’s self-healing tests add another useful layer here. If a locator stops matching because the UI changed, the platform can search surrounding context and recover the run, instead of immediately turning CI red. For teams dealing with evolving frontends, that is a practical way to reduce maintenance noise without asking an AI coding session to reconstruct the whole suite every time the DOM shifts.

The important question is not whether a tool can create a test quickly, it is whether the same tool can help the test survive the next six UI changes.

A short example of the maintenance gap

Imagine a checkout test written in Playwright.

import { test, expect } from '@playwright/test';

test('checkout flow', async ({ page }) => {
  await page.goto('https://example.com');
  await page.getByRole('button', { name: 'Buy now' }).click();
  await page.getByLabel('Email').fill('qa@example.com');
  await page.getByRole('button', { name: 'Pay' }).click();
  await expect(page.getByText('Order confirmed')).toBeVisible();
});

This is fine as a starting point. But six weeks later, the app has changed:

the CTA label became “Start checkout”
the payment form is inside a modal
a retry step was added after 3DS verification
the success message text changed slightly

A limited AI coding session can probably patch the file, but now it must also understand whether the test should be using a new page object, whether the label change is temporary, and whether other tests depend on the same journey. That is not a small ask if the assistant only has a partial view.

In a platform-oriented system, the test update is often more direct. The workflow is to adjust the test steps, review the new locator or assertion, and keep the test inside the same editing environment that will execute it.

When code-first automation still makes sense

This is not an argument to abandon Playwright or Selenium entirely. There are cases where code-first automation is the right choice:

complex custom test logic
heavy API and browser orchestration
deep integration with engineering tooling
teams with strong framework ownership and disciplined maintenance budgets
specialized scenarios that need fine-grained control

If your organization already has mature test infrastructure, great coding discipline, and enough engineering capacity to own the maintenance curve, code-first may be the right tradeoff.

But if the team is repeatedly hitting AI coding assistant limits, spending too much time regenerating or repairing tests, or relying on a small number of people who understand the suite, that is a signal. The problem may not be the AI. The problem may be the architecture.

Decision criteria for leaders

Use these questions to decide whether the current approach is sustainable:

Does every new test require the same small group of people to understand the framework?
Are failures often rooted in selectors, waits, or hidden state?
Do AI-assisted fixes feel helpful once, but expensive on the second and third revision?
Is your suite becoming harder to explain to non-specialists?
Would a UI change force multiple code edits across files?
Do you trust the test authoring process more than the underlying suite stability?

If several of those are true, the team is probably using a coding session as a long-term testing system. That is a mismatch.

Why platform-centered AI is the more reliable bet

The strongest argument for a platform like Endtest is not that it removes all maintenance. It is that it reduces the number of places where maintenance can go wrong.

With Endtest’s AI Test Creation Agent, a team can describe behavior in plain English and get a working test that lands as editable steps in the platform. That creates a shared authoring surface for QA, developers, PMs, and designers, without depending on someone to keep a fragile codebase and a limited AI session in sync.

For teams migrating from older code-based suites, Endtest also supports Selenium migration paths, which matters because many organizations do not want a risky rewrite. They want a practical exit from an increasingly brittle maintenance model.

This is why we think the platform approach is more reliable for organizations that care about continuity. It moves the work from “generate code, debug code, patch code, repeat” into a system where the test itself is the maintained artifact.

The bottom line

Limited AI coding sessions are useful, but they are a poor foundation for a test automation program that is expected to grow, change, and stay trustworthy.

Early on, they make you feel fast. Later, they make you pay for context loss, maintenance drift, and repeated debugging at the exact moment the suite needs momentum. That is the trap. The more the framework grows, the more each change requires context, reasoning, and follow-through, which means you are most likely to hit the limits when the work matters most.

If your team wants code-level control and is prepared to own the maintenance curve, Playwright or Selenium with AI assistance can still be a valid choice. But if the priority is a durable, team-friendly automation system with less dependence on fragile coding sessions, a platform-based approach is usually the better long-term bet.

That is the practical reason to treat limited AI coding sessions as a supplement, not the core of your testing strategy.