AI Test Data Generation for Dynamic Forms: What We Tried, What Broke, and What Helped

Dynamic forms look simple until you try to test them at scale. A single checkout or onboarding screen can contain conditional fields, masked inputs, locale-specific labels, server-side validation, async lookups, and business rules that only appear after three prior answers. That combination is exactly where static fixtures and hand-authored test data start to fall apart.

This lab notebook captures what we learned while experimenting with AI test data generation for dynamic forms. The goal was not to chase a shiny model, but to answer a practical question: can AI help us generate realistic test data for forms with conditional fields and edge-case validation without creating a maintenance burden of its own?

Short answer, yes, but only if you treat AI as a data assistant, not as an oracle. The biggest wins came when we constrained the output format, separated data generation from UI interaction, and kept the test steps editable. The biggest failures came from letting opaque workflows invent too much state, too many values, and too many assumptions about how the form behaved.

Why dynamic forms are a test data problem first

When people talk about dynamic form testing, they usually focus on locators, waits, or flaky UI transitions. Those matter, but many failures are really test data failures in disguise.

A dynamic form can branch based on:

country or locale
user role or subscription tier
age, income, or other eligibility inputs
prior selections in a wizard
feature flags
server responses, such as suggested addresses or validation lookups
formatting rules, such as postal codes, phone numbers, or tax IDs

That means the test itself needs data that can trigger a branch, satisfy validation, and remain readable to the person maintaining the suite. If the data is wrong, you do not learn whether the form logic is correct. You only learn that your fixture was invalid.

This is where test data generation for QA starts to matter more than it usually does. For basic CRUD tests, you can often get by with a few hardcoded records. For dynamic forms, especially those under localization pressure, you need a repeatable way to generate data that is valid enough to reach the branch you care about, but weird enough to expose failures.

The useful test data is not the most realistic data possible, it is the data that reliably drives the exact branch you need to verify.

What we tried first

We started with three approaches.

1) Handwritten fixtures

This is the oldest pattern, and still the easiest to understand.

We kept JSON files for user profiles, addresses, and payment-like fields, then mapped them into form inputs in the test code.

{ “country”: “DE”, “postalCode”: “10115”, “phone”: “+49 30 123456”, “newsletter”: true, “preferredContact”: “email” }

This worked well for stable branches, especially when the form only needed one or two values to flip a conditional section on or off. It failed when the form had multiple dependent fields, because the combinatorial explosion was immediate. If one field changed, several fixtures needed updates. If locale logic changed, the same fixture might become invalid in one environment and valid in another.

The maintenance cost was not catastrophic, but it was constant.

2) Fuzzing with random values

We then used simple generators for names, emails, dates, phone numbers, and addresses. This helped surface validation gaps, but it was not enough for dynamic forms with business rules.

Random generation is fine when the field only needs to match a type. It is much less useful when the form expects coherent combinations, such as:

date of birth and age gate
country and postal code format
state and tax jurisdiction
dependent fields that are only required after a prior choice
localized names that should pass character validation

Random data also created noisy failures. We found ourselves debugging the generator instead of the form.

3) AI-generated records

The most interesting results came when we asked an AI system to produce structured test records from a form schema and a scenario prompt, for example:

“Generate a valid German onboarding record for a freelancer”
“Generate an invalid record that should fail postal code validation in France”
“Generate a record that triggers the VAT ID field and leaves it empty”

This was the first approach that consistently gave us branch-aware data without hand-authoring every case. It was also the first approach that created new categories of failure.

What broke with AI-generated test data

AI-generated test data can be very good at looking plausible. That is also the trap.

1) Plausible but invalid combinations

A model can easily generate values that look correct in isolation but fail as a set. Examples included:

a province that does not exist for the selected country
a postal code format that is valid for one region but not the chosen locale
a phone number that has the right length but the wrong country prefix
a birthdate that makes the user too young for the selected product

These are not model failures so much as context failures. The form rules depend on relationships, not just fields.

2) Overfitting to demo-friendly data

Some generated records looked polished but were too generic. If every AI-generated user is a 32-year-old marketing manager in a major city, you are not actually testing much diversity.

Dynamic forms often fail on edge cases, not happy paths. We needed data that could intentionally probe weirdness:

long names with apostrophes or compound surnames
diacritics and non-Latin scripts
addresses with apartment numbers, building names, or unusual ordering
blank optional fields next to required dependent fields
ambiguous dates, such as 01/02/2024 in locale-sensitive forms

3) Hidden assumptions in prompts

We learned quickly that prompt wording matters more than we wanted it to. If the prompt says “generate a realistic user,” the model will optimize for realism, not for branch coverage. If the prompt says “generate an edge case,” it may create data that fails too early and never reaches the target section.

For dynamic form testing, the prompt should specify:

target branch or validation rule
required locale
acceptable and unacceptable values
whether the record should be valid, invalid, or borderline
any downstream expectations, such as which section should appear

4) Poor traceability

The more opaque the workflow, the harder it was to know why a test failed.

If a test run included both AI-generated input and AI-driven interaction logic, a failure could come from any layer:

the data was invalid
the form UI changed
the validation message changed
the locator was brittle
the branching logic did not match the prompt

That is too many variables for a suite that needs to be debugged under CI pressure.

What helped most

The most useful improvements were not fancy. They were boring in the best possible way.

1) Constrain the output schema

We stopped asking for free-form records and started asking for structured payloads with explicit field names and value constraints.

For example, a generator prompt should produce data like this:

{ “locale”: “fr-FR”, “country”: “FR”, “postalCode”: “75008”, “phone”: “+33 1 42 68 53 00”, “taxId”: “”, “expectedBranch”: “tax_id_required” }

That made the generated data easier to validate before it entered the UI test. It also made failures easier to inspect in logs.

2) Split test data generation from execution

This was the biggest design improvement.

Instead of letting AI generate data while also controlling the browser, we generated data first, stored it as a test artifact, and then fed that artifact into a normal test flow.

That separation gave us:

deterministic reruns
easier review in pull requests
simpler debugging
the ability to reuse the same record across multiple tests

If a record is bad, reject it before it reaches the browser.

3) Add validation on the generated data itself

We created a lightweight validation layer that checked the generated payload against known constraints before any UI action ran.

Examples:

country and postal code must match known regex patterns
required dependent fields must be present when a branch is selected
localized values must use allowed character sets
“invalid” scenarios must fail for the right reason, not just any reason

This can be done with JSON Schema, custom validators, or domain-specific checks. The point is to catch nonsense early.

4) Keep canonical fixtures for critical paths

AI-generated data is helpful, but we did not replace every stable fixture. For high-risk flows, such as payment, identity, or compliance-related forms, we kept a small set of canonical fixtures that were deliberately boring.

That gave us a reliable baseline. AI-generated records then expanded branch coverage around that baseline.

5) Make the UI steps editable

This matters more than it sounds.

Opaque AI workflows often couple data generation, element discovery, and action sequencing into a single black box. When that breaks, the whole test becomes hard to salvage.

Tools with editable test creation steps help reduce that maintenance burden because the test remains inspectable and adjustable. In platforms like Endtest, an agentic AI [Test automation](https://en.wikipedia.org/wiki/Test_automation) platform,, the AI Test Creation Agent creates standard editable platform-native steps inside the product, which is much easier to maintain than an uneditable flow that hides the intermediate logic. That is not a reason to use one tool everywhere, but it is a strong reason to prefer workflows where the generated artifact stays understandable.

A practical pattern for dynamic form testing

For teams building synthetic test data pipelines, this pattern worked well.

Step 1, describe the branch you want

Use a scenario definition, not a vague prompt.

Examples:

valid user in Germany, tax ID optional, newsletter enabled
invalid user in France, postal code too short
Spanish user, conditional company name required, leave it empty
Canadian user, province selected, phone format intentionally malformed

Step 2, generate the record

Have the AI output a structured record with only the fields the test needs.

Step 3, validate the record

Reject records that do not meet your preconditions.

Step 4, execute a normal test

Use stable locators, explicit waits, and a reusable data setup layer.

Step 5, assert the branch outcome

Check that the right section appears, the right validation text is shown, and the form state matches the chosen scenario.

Here is a compact Playwright example showing the execution side, with data injected from a fixture-like object rather than invented inline:

import { test, expect } from '@playwright/test';

const data = { locale: ‘fr-FR’, country: ‘FR’, postalCode: ‘75008’, taxId: ‘’, expectedBranch: ‘tax_id_required’ };

test('conditional tax id field appears for FR', async ({ page }) => {
  await page.goto('/signup');
  await page.selectOption('[name="country"]', data.country);
  await page.fill('[name="postalCode"]', data.postalCode);
  await page.click('button[type="submit"]');

await expect(page.getByText(‘Tax ID is required’)).toBeVisible(); });

This looks simple, but the key is that the record is not auto-magically woven into the test logic. It is fed into a predictable test.

Localization edge cases deserve their own data strategy

Localization is where AI-generated test data can be especially useful, and especially dangerous.

The useful part is obvious. AI can generate names, addresses, and messages in multiple languages faster than a human can maintain dozens of locale-specific fixture files.

The dangerous part is that localization is rarely just translation. It is formatting, validation, and business logic all at once.

Things we had to account for:

date formats that change by locale
decimal separators in currency fields
right-to-left UI behavior
Unicode normalization in names and search fields
language-specific validation messages
country-specific required fields

For example, a form may accept a postal code pattern but still reject the record because the translated label no longer matches a selector strategy or assertion. That is one reason it helps to use robust checks, not brittle string equality for everything.

If your validation is about meaning rather than exact copy, tools such as AI Assertions can be useful in the right places, because they let you describe what should be true without binding every test to a fixed string. For a dynamic form with translated success states, that can reduce churn. Use that judiciously, though, because plain assertions are still better when the exact value is the point.

Maintenance tradeoffs, the part teams usually underestimate

The first question teams ask is whether AI-generated test data saves time. The better question is where the time moves.

It reduces manual fixture creation

Yes, especially for broad coverage across locales and branches.

It adds validation and review overhead

Also yes. Someone has to define schemas, scenario templates, and domain constraints.

It can make failures harder to explain

Unless the generated record is logged clearly and reused across reruns.

It can increase coverage without increasing human writing time

That is the strongest argument for it, especially when the form space is combinatorial.

The maintenance story improves significantly when reusable data setup is separated from the UI steps. This is where editable workflows matter. If a generated test is just a black box, future maintainers will be afraid to touch it. If the same test is broken into explicit steps, reviewers can change a locator, swap a field value, or update an assertion without regenerating the whole flow.

That also makes it easier to combine AI-generated data with self-healing or locator recovery features in a responsible way. For example, self-healing can help when a class name changes, while the data layer remains deterministic. Endtest’s self-healing tests are in this category, where locator drift is reduced without turning the entire test into a mystery box. The important part is that healing should fix locator drift, not hide bad data or broken branching logic.

A few patterns we stopped using

Fully random end-to-end runs

Interesting for exploration, poor for CI signal.

One giant prompt for the whole form flow

Too many assumptions, too much hidden state.

Reusing the same synthetic person everywhere

Fast, but it hides bugs in locale, validation, and personalization logic.

Asserting only success paths

Dynamic forms fail in important ways, and those failures should be first-class tests.

When AI test data generation is worth it

Use it when you need one or more of these:

coverage across many locale and branch combinations
edge cases that are too numerous to hand-author
repeatable invalid data for validation testing
generation of privacy-safe synthetic data instead of production copies
faster expansion of a form matrix without multiplying fixture files

Be cautious when the flow is highly regulated, the form logic is simple, or the test needs perfect determinism from the first run.

A simple decision rule

If the form branch depends mostly on field values and locale, AI-generated data can help.

If the test failure must always be explainable by a human in one minute, keep the generated data constrained and reviewable.

If the workflow hides the generated data inside an uneditable chain of agent actions, expect maintenance pain later.

If your platform lets you keep the steps editable and the data reusable, the odds improve significantly.

Closing notes from the lab

The main lesson from this experiment was not that AI can replace test data work. It cannot. The useful lesson is that AI can accelerate the construction of diverse, realistic, branch-specific data, but only if the surrounding system is disciplined.

For dynamic form testing, that discipline means:

structured outputs
upfront validation
reusable datasets
explicit branch goals
editable test steps
stable assertions where possible
meaning-based checks where exact text is not the point

If you are evaluating tools, look for the boring things first. Can you inspect the generated data? Can you edit the steps? Can you rerun the same scenario exactly? Can you tell whether a failure was data-related or UI-related?

Those questions matter more than whether the generator sounded intelligent.

For teams that want AI-assisted testing without giving up control, that is the real bar. The best setup is not the most automated one, it is the one your team can keep understanding six months later.