Evaluating Multi-Step LLM-Generated Content: Why Customer Journeys Require Structural Metrics

generate customer journeys that appear smooth and engaging, but evaluating whether these journeys are structurally sound remains challenging for current methods.

This article introduces Continuity, Deepening, and Progression (CDP) — three deterministic, content-structure-based metrics for evaluating multi-step journeys using a predefined taxonomy rather than stylistic judgment.

Traditionally, optimizing customer-engagement systems has involved fine-tuning delivery mechanics such as timing, channel, and frequency to achieve engagement and business results.

In practice, this meant you trained the model to understand rules and preferences, such as “Don’t contact customers too often”, “Client Alfa responds better to phone calls”, and “Client Beta opens emails mostly in the evening.”

To manage this, you built a cool-off matrix to balance timing, channel constraints, and business rules to govern customer communication.

So far, so good. The mechanics of delivery are optimized.

At this point, the core challenge arises when the LLM generates the journey itself. The issue is not just about channel or timing, but whether the sequence of messages forms a coherent, effective narrative that meets business objectives.

And suddenly you realize:

There is no standard metric to determine if an AI-generated journey is coherent, meaningful, or advances business goals.

Contents

What We Expect From a Successful Customer Journey
Why Existing LLM Evaluation Metrics Fall Short
A Structural Approach to Evaluating Customer Journeys
Applying CDP Metrics to an Automotive Customer Journey
Input Data: Anchors and Journey Content
Taxonomy Construction Results
- What the Taxonomy Reveals
- Why This Matters for Evaluation
Mapping Customer Journeys onto the Taxonomy
A Step-by-Step Walkthrough: From Journey Text to CDP Scores
Conclusion: From Scores to Successful Journeys

What We Expect From a Successful Customer Journey

From a business perspective, the sequence of contents per journey step cannot be random: it must be a guided experience that feels coherent, moves the customer forward through meaningful stages, and deepens the relationship over time.

While this intuition is common, it is also supported by customer-engagement research. Brodie et al. (2011) describe engagement as “a dynamic, iterative process” that varies in intensity and complexity as value is co-created over time.

In practice, we evaluate journey quality along three complementary dimensions:

Continuity — whether each message fits the context established by prior interactions.

Deepening — whether content becomes more specific, relevant, or personalized rather than remaining generic.

Progression — whether the journey advances through stages (e.g., from exploration to action) without unnecessary backtracking.

Why Existing LLM Evaluation Metrics Fall Short

If we look at standard evaluation methods for LLMs, such as accuracy metrics, similarity metrics, human-evaluation criteria, or even LLM-as-a-judge, it becomes clear that none provide a reliable, unambiguous way to evaluate customer journeys generated as multi-step sequences.

Let’s examine what standard customer journey metrics can and can’t provide.

Accuracy Metrics (Perplexity, Cross-Entropy Loss)

These metrics measure confidence level in predicting the next token given the training data. They do not capture whether a generated sequence forms a coherent or meaningful journey.

Similarity Metrics (BLEU, ROUGE, METEOR, BERTScore, MoveScore)

These metrics compare the generated result to a reference text. However, customer journeys rarely have a single correct reference, as they adapt to context, personalization, and prior interactions. Structurally valid journeys may differ significantly while remaining effective.

Undoubtedly, semantic similarity has its advantages, and we’ll use cosine similarity, but more on that later.

Human Evaluation (Fluency, Relevance, Coherence)

Human judgment often outperforms automated metrics in assessing language quality, but it is poorly suited for continuous journey evaluation. It is expensive, suffers from cultural bias and ambiguity, and does not function as a permanent part of the workflow but rather as a one-time effort to bootstrap a fine-tuning stage.

LLM-as-a-Judge (AI feedback scoring)

Using LLMs to evaluate outputs from other LLM systems is an impressive process.

This approach tends to focus more on style, clarity, and tone rather than structural evaluation.

LLM-as-a-Judge can be applied in multi-stage use cases, but results are often less precise due to the increased risk of context overload. Additionally, fine-grained evaluation scores from this method are often unreliable. Like human evaluators, LAAJ also carries biases and ambiguities.

A Structural Approach to Evaluating Customer Journeys

Ultimately, the primary missing element in evaluating recommended content sequences within the customer journey is structure.

The most natural way to represent content structure is as a taxonomic tree, a hierarchical model consisting of stages, content themes, and levels of detail.

Once customer journeys are mapped onto this tree, CDP metrics can be defined as structural variations:

Continuity: smooth movement across branches
Deepening: moving into more specific nodes
Progression: moving forward through customer journey stages

The solution is to represent a journey as a path through a hierarchical taxonomy derived from the content space. Once this representation is established, CDP metrics can be computed deterministically from the path. The diagram below summarizes the entire pipeline.

Image created by the author

Constructing the Taxonomy Tree

To evaluate customer journeys structurally, we first require a structured representation of content. We construct this representation as a multi-level taxonomy derived directly from customer-journey text using semantic embeddings.

The taxonomy is anchored by a small set of high-level stages (e.g., motivation, purchase, delivery, ownership, and loyalty). Both anchors and journey messages are embedded into the same semantic vector space, allowing content to be organized by semantic proximity.

Within each anchor, messages are grouped into progressively more specific themes, forming deeper levels of the taxonomy. Each level refines the previous one, capturing increasing topical specificity without relying on manual labeling.

The result is a hierarchical structure that groups semantically related journey messages and provides a stable foundation for evaluating how journeys flow, deepen, and progress over time.

Mapping Customer Journeys onto the Taxonomy

Once the taxonomy is established, individual customer journeys are mapped onto it as ordered sequences of messages. Each step is embedded in the same semantic space and matched to the closest taxonomy node using cosine similarity.

This mapping converts a temporal sequence of messages into a path through the taxonomy, enabling the structural analysis of journey evolution rather than treating the journey as a flat list of texts.

Defining the CDP Metrics

The CDP framework consists of three complementary metrics: Continuity, Deepening, and Progression. Each captures a distinct aspect of journey quality. We describe these metrics conceptually before defining them formally based on the taxonomy-mapped journey.

CDP answers 2 — *Table 1: Each CDP metric captures a different aspect of journey quality: coherence, specificity, and progression.*

Setup and Computation

Before analyzing real journeys, we clarify two aspects of the setup.
(1) how journey content is structurally represented, and
(2) how CDP metrics are derived from that structure.

Customer-journey content is organized into a hierarchical taxonomy consisting of anchors (L1 journey stages), thematic heads (L2 topics), and deeper nodes that represent increasing specificity:

Anchor (L1)
└── Head (L2)
     └── Child (L3)
          └── Grandchild (L4+)

Once a journey is mapped onto this hierarchy, Continuity, Deepening, and Progression are computed deterministically from the journey’s path through the tree.

Let a customer journey be an ordered sequence of steps:

J = (x₁, x₂, …, xₙ)

Each step xᵢ is assigned:

aᵢ — anchor (L1 journey stage)
tᵢ — thematic head (L2 topic), where tᵢ = 0 means “unknown”
ℓᵢ — taxonomy depth level (L1 = 0, L2 = 1, L3 = 2, …)

Continuity (C)

Continuity evaluates whether consecutive messages remain contextually and thematically coherent.

For each transition (xᵢ →xᵢ₊₁), a step-level continuity score cᵢ ∈ [0, 1] is assigned based on taxonomy alignment, with higher weights given to transitions that stay within the same topic or closely related branches.

Transitions are ranked from strongest to weakest (e.g., same topic, related topic, forward stage move, backward move), and
assigned decreasing weights:

1 ≥ α₁ > α₂ > α₃ > α₄ > α₅ > α₆ ≥ 0

The overall continuity score is computed as:

C(J) = (1 / (n − 1)) · Σ cᵢ for i = 1 … n−1

Deepening (D)

Deepening measures whether a journey accumulates value by moving from general content toward more specific or detailed
interactions. It is computed using two complementary components.

Journey-based deepening captures how depth changes along the observed path:

Δᵢᵈᵉᵖᵗʰ = ℓᵢ₊₁ − ℓᵢ, dᵢ = max(Δᵢᵈᵉᵖᵗʰ, 0)

D_journey(J) = (1 / (n − 1)) · Σ dᵢ

Taxonomy-aware deepening measures how deeply a journey explores the actual taxonomy tree, based on the heads it visits.
It evaluates how many of the possible deeper content items (children, sub-children, etc.) under each visited head are later seen
during the journey.

D_taxonomy(J) = |D_seen(J)| / |D_exist(J)|

The final deepening score is a weighted combination:

D(J) = λ₁ · D_taxonomy(J) + λ₂ · D_journey(J), λ₁ + λ₂ = 1.

Deepening lies in [0, 1].

Progression (P)

Progression measures directional movement through journey stages. For each transition, we compute:

Δᵢ = aᵢ₊₁ − aᵢ.

Only moving steps (Δᵢ ≠ 0) are considered. Let wᵢ denote the relative importance of the current stage.

If Δᵢ > 0 (forward movement):
cᵢ = wᵢ / Δᵢ
If Δᵢ < 0 (backward movement):
cᵢ = Δᵢ · wᵢ

The raw progression score is:

P_raw(J) = Σ cᵢ for all i where Δᵢ ≠ 0

To bound the score to[−1, +1], we apply a tanh normalization:

P(J) = (e^(P_raw) − e^(−P_raw)) / (e^(P_raw) + e^(−P_raw))

Applying CDP Metrics to an Automotive Customer Journey

To demonstrate how structured evaluation works on realistic journeys, we generated a synthetic automotive customer-journey dataset covering the main stages of the customer lifecycle.

autoCJ — Image created by the author using Excalidraw

Input Data: Anchors and Journey Content

The CDP framework uses two main inputs: anchors, which define journey stages, and customer-journey content, which provides the messages to evaluate.

Anchors represent meaningful phases in the lifecycle, such as motivation, purchase, delivery, ownership, and loyalty. Each anchor is augmented with a small set of representative keywords to ground it semantically. Anchors serve both as reference points for taxonomy construction and as the expected directional flow used later in the Progression metric.

anchor Words:
motivation exploration research discovery interest test drive needs assessment experience
purchase financing comparison quotes loan negotiation credit pre-approval deposit
delivery paperwork signing deposit logistics handover activation
ownership maintenance warranty repair dealer support service inspections
loyalty feedback satisfaction survey referral upgrade retention advocacy

Customer-journey content consists of short, action-oriented CRM-style messages (emails, calls, chats, in-person interactions) with varying levels of specificity and spanning multiple stages. Although this dataset is synthetically generated, anchor information is not used during taxonomy construction or CDP scoring.

CJ messages:
Explore models that match your lifestyle and personal goals.
Take a virtual tour to discover key features and trims.
Compare body styles to assess space, comfort, and utility.
Book a test drive to experience handling and visibility.
Use the needs assessment to rank must-have features.
Filter models by range, mpg, or towing to narrow choices.

Taxonomy Construction Results

Here, we applied the taxonomy construction process to the automotive customer-engagement dataset. The figure below shows the resulting customer-journey taxonomy, built from message content and anchor semantics.

Each top-level branch corresponds to a journey anchor (L1), which represents major journey stages such as Motivation, Purchase, Delivery, Ownership, and Loyalty.

Deeper levels (L2, L3+) group messages by thematic similarity and increasing specificity.

Evaluating Multi-Step LLM-Generated Content: Why Customer Journeys Require Structural Metrics 3 — Taxonomy of Customer-Journey Messages

What the Taxonomy Reveals

Even in this compact dataset, the taxonomy highlights several functional patterns:

Early-stage messages cluster around exploration and comparison, gradually narrowing toward concrete actions such as booking a test drive.
Purchase-related content separates naturally into financial planning, document handling, and finalization.
Ownership content shows a clear progression from maintenance scheduling to diagnostics, cost estimation, and warranty evaluation.
Loyalty content shifts from transactional actions toward feedback, upgrades, and advocacy.

While these patterns align with how practitioners typically reason about journeys, they arise directly from the data rather than from predefined rules.

Why This Matters for Evaluation

This taxonomy now provides a shared structural reference:

Any customer journey can be mapped as a path through the tree.
Movement across branches, depth levels, and anchors becomes measurable.
Continuity, Deepening, and Progression are no longer abstract concepts; they now correspond to concrete structural changes.

In the next section, we use this taxonomy to map real journey examples and compute CDP scores in steps.

Mapping Customer Journeys onto the Taxonomy

Once the taxonomy is constructed, evaluating a customer journey becomes a structural problem.

Each journey is represented as an ordered sequence of customer-facing messages.

Instead of judging these messages in isolation, we project them onto the taxonomy and analyze the resulting path.

Formally, a journey J = (x₁, x₂, …, xₙ) is mapped to a sequence of taxonomy nodes: (x₁→v₁),(x₂→v₂),…,(xₙ→vₙ) where each vᵢ is the closest taxonomy node based on embedding similarity.

A Step-by-Step Walkthrough: From Journey Text to CDP Scores

To make the CDP framework concrete, let’s walk through a single customer journey example and show how it is evaluated step by step.

Step 1 — The Customer Journey Input

We begin with an ordered sequence of customer-facing messages generated by an LLM.
Each message represents a touchpoint in a realistic automotive customer journey:

journey = ['Take a virtual tour to discover key features and trims.'; 
'We found a time slot for a test drive that fits your schedule.'; 
'Upload your income verification and ID to finalize the pre-approval decision.';
'Estimate costs for upcoming maintenance items.'; 
'Track retention offers as your lease end nears.'; 
'Add plates and registration info before handover.']

Step 2 — Mapping the Journey into the Taxonomy

For structural evaluation, each journey step is mapped into the customer-journey taxonomy. Using text embeddings, each message is matched to its closest taxonomy node. This produces a journey map (jmap), a structured representation of how the journey traverses the taxonomy.

_{^{Table 2: Each message is assigned to an anchor (stage), a thematic head, and a depth level in the taxonomy based on semantic similarity in the shared embedding space. This table acts as the foundation for all future evaluations.}}

Step 3 — Applying CDP Metrics to This Journey

Once the journey is mapped, we compute Continuity, Deepening, and Progression deterministically from step-to-step transitions.

^{Table 3: Each row represents a transition between consecutive journey steps, annotated with signals for continuity, deepening, and progression.}

Final CDP scores (this journey):

Taken together, the CDP signals indicate a journey that is largely coherent and forward-moving, with one clear moment of
deepening and one visible structural regression. Importantly, these insights are derived solely from structure, not from
stylistic judgments about the text.

Conclusion: From Scores to Successful Journeys

Continuity, Deepening, and Progression are determined by structure and can be applied wherever LLMs generate multi-step
content:

to compare alternative journeys generated by different prompts or models,
to provide automated feedback for improving journey generation over time.

In this way, CDP scores offer structural feedback for LLMs. They complement, rather than replace, stylistic or fluency-based evaluation by providing signals that reflect business logic and customer experience.

Although this article focuses on automotive commerce, the concept is broadly applicable. Any system that generates ordered, goal-oriented content requires strong structural foundations.

Large language models are already capable of generating fluent, persuasive text.
The greater challenge is ensuring that text sequences form coherent narratives that align with business logic and user experience.

CDP provides a way to make structure explicit, measurable, and actionable.

Thank you for staying with me through this journey. Hopefully, this concept helps you think differently about evaluating AI-generated sequences and inspires you to treat structure as a primary signal in your own systems. All logic presented in this article is implemented in the accompanying Python code on GitHub. If you have any questions or comments, please leave them in the comments section or reach out via LinkedIn.

References

Brodie, R. J., et al. (2011). Customer engagement: Conceptual domain, fundamental propositions, and implications for research.

What We Expect From a Successful Customer Journey

Why Existing LLM Evaluation Metrics Fall Short

Accuracy Metrics (Perplexity, Cross-Entropy Loss)

Similarity Metrics (BLEU, ROUGE, METEOR, BERTScore, MoveScore)

Human Evaluation (Fluency, Relevance, Coherence)

LLM-as-a-Judge (AI feedback scoring)

A Structural Approach to Evaluating Customer Journeys

Constructing the Taxonomy Tree

Mapping Customer Journeys onto the Taxonomy

Defining the CDP Metrics

Setup and Computation

Continuity (C)

Deepening (D)

Progression (P)

Applying CDP Metrics to an Automotive Customer Journey

Input Data: Anchors and Journey Content

Taxonomy Construction Results

What the Taxonomy Reveals

Why This Matters for Evaluation

Mapping Customer Journeys onto the Taxonomy

A Step-by-Step Walkthrough: From Journey Text to CDP Scores

Step 1 — The Customer Journey Input

Step 2 — Mapping the Journey into the Taxonomy

Step 3 — Applying CDP Metrics to This Journey

Conclusion: From Scores to Successful Journeys

Leave a Comment Cancel Reply