loader

Sample Chapter: The Dark Art of Conversation Design

Ok, so I thought it would be useful to include a sample chapter from my upcoming book. “How to write Voice AI agents your customer’s won’t hate. So this chapter is an excerpt to give you an idea of the kind of insights (and the style of writing) for the book: Enjoy the content, even […]

Ok, so I thought it would be useful to include a sample chapter from my upcoming book. “How to write Voice AI agents your customer’s won’t hate. So this chapter is an excerpt to give you an idea of the kind of insights (and the style of writing) for the book:

Enjoy the content, even though the formatting has gone crazy….

Chapter 3: The dark art of conversation design

I have over my career worked with or managed many programmers, some whose problem-solving capabilities defied belief, however one of the things I have learnt is that conversation design is hard, and it’s a skillset completely different to programming.

Voice metrics vs text metrics

Humans on average speak at round 150 and less (c 130) when providing clarity which is often needed in a customer support scenario. Compare that to a reading rate of 250 words per minute and voice is 50% slower than a text channel like webchat – remember that.

And LLMs are even worse they can produce 15,000 words in milliseconds, so the very first thing in your LLM prompt will be to remind it that you are in a call centre and so answers must be succinct (ideally a short single sentence).

Why is it so difficult?

Conversation is inherently emotional, whereas traditional software development is logical. And taking a logic approach to an emotional issue is probably the most common reason for divorce…

The skills and behaviours for conversation design are far more suited to a call centre agent, or a business analyst than a programmer or architect, so review the skills on your team and, if necessary, get someone in who gets customers and process to help.

Clear use cases

Before starting you should have very clear use cases for what each conversation is designed to achieve. If you’re lucky enough to have good product and service management – the definition of what a customer can do with a product (replace it, fix it, cancel it, upgrade it etc etc) should be clearly defined already.  If it isn’t then I would recommend defining that before you try and design your conversation flow – because for a successful customer experience it’s all about capturing their intent.

Intent and Goal Orientated

So, the customer has reached out you – it’s for a reason, they want you to help them in some way – this is the customer intent and, once you have the intent, each step in the conversation should have the sole goal of fulfilling that intent as quickly as possible.

Each intent will be its own workflow; however, some products will have shared components (e.g. Verifying the customer) or parameters (e.g. a telephone number) across all intents.

Where you have a shared component across all your flows, it’s a good candidate to place in the concierge agent that does the initial greeting and routing to the other agents.

Did we mention an escape hatch?

Not to have to re-iterate this too many times, but the customer must be given an escape hatch, to break out from the AI and reach a human (or at least a call queue).

Transparency & Trust

Be transparent with the customer about the AI can and can’t do, Transparency builds trust. Likewise, if a part of the process takes a while (>10-15 seconds), let the customer know and offer them a call back if it’s really long.

People are inherently sceptical of AI, the more trust you can build, the better your customer satisfaction will be. We cover this in the next section on Human Centred Design.

(Note this goes against Google’s best practice guidance for voice agents– but I’m going with the scientific research on empathy combined with an iron-clad personal belief that transparency with customers builds trust).

Error Handling & the three fingered salute

Pre-GPT in the dark and dingy days of Lex vs Dialogflow the default configuration of an Alexa bot was to say “Sorry I don’t understand” three times and then hang up. We coined this “the three fingered salute” and it was one of the worst customer experiences ever.

You will have errors, or mistakes, or unrecognizable customer conversations due to pronunciation or background noise and these should be captured in a fallback loop – it’s inevitable, but how you handle it is another good way to differentiate.

Choose the right voice

There has been significant research on whether people prefer a male or female voice and in general, female voices are trusted more with the exclusion of medical and sport, however if you choose a specialized voice or even clone a human voice from a human, you can potentially get better user experience than taking the out of the box audio voices.

I encourage you to look and evaluate at the specialist voice vendors and gather real customer feedback on which voice they feel more comfortable engaging with.

Best practice for natural language

Ultimately the litmus test for every agent conversation is “does it feel like I am talking to a robot?”, this is the topic of natural language and how to make your agent feel as human (or really non-robotic) as possible.

This is the area where I see the most common issues with voice agents, they can resolve issues but it’s not a natural, human engagement, it feels more like using a vending machine.

Below are the best practice pointers to help you avoid the most common issues.

A quick discussion of Speech (Synthesis) Markup Language

Ther is a formal definition based on an XML format for instructing computers how to modify the way they pronounce audio. You can control pitch tempo, the speed of speaking (they call this prosody rate) the works.

It involves looking at every single turn ion the flow and changing the text to something like the following:

<speak>
  Hello, I can <emphasis level=”strong”>speak</emphasis> with different <prosody rate=”slow”>rates</prosody> and <break time=”500ms”/> pauses.
</speak>

This is a real Pandora’s box it gives you complete control over how audio is pronounced but my god the amount of tweaking and farming through the reeds you will end up doing prevents you from scaling to hundreds or thousands of conversation flows.

We use it by exception and only for two cases:

              To speed up confirmations of data that the user just provided.

              To correct important phonetic pronunciation (like company / C-level names).

I expect you could get good results from vibe coding the text prompts -> SSML to sound more lifelike and realistic – when I get some space with a junior dev, we might explore this (as it would scale) but given how good models are getting, we won’t get to that before publishing (anyone who has tried this – hit me up on LinkedIn  and let me know how it went).

Non-repetitive

This is the most common challenge, particularly if you “program” your agent’s response with direct text or a default saying (like the Lex three fingered salute).  You might program it to say “thanks ill check that for you now” before a long process – that’s good transparency – but it will get very grating if it happens 3 times in a conversation.

If you are programming the text, create an array of 10-12 possible phrases that portray the same meaning and pick one at random for the agent to say, likewise remove default fallbacks (“Sorry I don’t understand”) and use the same approach 1 of 10 there.

But repetition is also a problem between the customer and the agent if the agent is playing back exactly what the customer said.  Occasionally, in the design there will be a crucial thing such as an email address that must be accurate – but only play back what the customer has said for those crucial bits and if you can, only ask them to confirm what is at risk of being wrong.

Phonetics (Is that Stephen with a “ph.”, or Steven with a v – they sound identical) is a big but complex topic that we cover in advanced voice topics chapter.

But remember – don’t make your agent a (stochastic) parrot.

Personalized

Personalization helps, if you don’t have a CRM integration, build a simple table so that you know how to greet someone when they call (if you have that data) or on the second time they call, saying “Welcome back, Stephen, how can I help you today” is far more personalized than “Hi, I am Alice, can I get your name please?”

Brevity

There is a design approach called Occam’s Razor which states:

“Make it as simple as possible, but no simpler.” And this is true not just of conversational flow design but also but also of the dialogue.

When I’m teaching junior voice engineers, I have recoined Occam’s Razor as Dangermouse’s Bludgeon: “Make it as short as humanly and empathically possible, but no shorter.”

Take the following example from a doctor booking agent I was asked to review recently:

Twice they confirmed the time of the appointment in the flow, each time they said:

“Tuesday September 30th 2025 at 11am in the morning”  

which takes 4 seconds to say, 8 seconds when done twice. Compare that to:

“Tomorrow at eleven” 0.68 seconds (or even “next Tuesday at 11”)

Review your flows and see where you can shorten what’s said without losing quality or meaning. Faster interactions are generally better ones.

Static Data & Relativity

This segways nicely into another robotic syndrome. To resolve the customer issue, agents need parameters to pass to the process or automated backend. These parameters usually involve static data like dates, addresses or telephone numbers.

Think about how you pronounce them or use them once they have been captured and avoid letting agents say the full value when a relative or shorter value would suffice.

Examples:

“We will send it to your office at Suite 11/246, 140 Pitt Street Sydney NSW 2000”

“We’ll send it to your Sydney office”

Or

              “We will send a text to 0427 654123 when it’s ready”

“We’ll send you an SMS when it’s ready”

Use actionable questions

Actionable questions steer the customer towards the action needed as the next step (our goal). Avoid open-ended questions.

“Yes, I can see the issue, would you like me to try and  fix it with you?”

“Yes, I can see there are issues”

Provide Evidence

We will explore this in the next chapter, but I’ll raise it here. Customers want evidence/proof that you will fulfil their request exactly how they have asked you to. This is even more true in an AI agent scenario.

Early conversation flow developers feel that they have to confirm everything back to the customer in the call, that isn’t true, and it leads to robotic playback of names, emails etc.

You can use out of band evidence, like an SMS or an email to provide that evidence and also the confirmation that you have captured everything correctly.  This can take a lot of the pain and time out of the call.

Remember don’t over-confirm in the conversation flow if you can confirm it outside the flow (e.g. by SMS)

Informal/colloquial instead of formal

Do you remember the TV show “Hawaii Fived Zero”? Nope neither do I, everyone called it, colloquially, “Hawaii Five oh”. When pronouncing your own mobile, do you use the formal “zero four one two” or do you say: “oh four one two.” Finally, do you notice how you say the informal version much faster?

AI agents announcing numbers default to formal and slow (slow provides clarity – which is good – but not if its confirming information they just provided you).  We don’t advocate the use of Speech Markup Language in many cases – but here, particularly when you are confirming something the user already knows (e.g. they just provided you their address or mobile number) – that is a good use of  SSML to speed up the response by 20 – 33%, just remember to include in your prompt: “If you see SSML , obey it but do not pronounce the tags” or you’ll get some really confused customers.

Support pre-filling

This is another important topic, and crucial if customers are going to repeat engagement with a particular flow. Pre-filling is where the customer knows the turns, so they will try and respond to additional turns/questions in the first response.

So as a design principle, if you have an architecture that supports it, evaluate the user’s first response against all subsequent questions – this creates what we call internally the opportunity for a “one-shot” where the customer can answer all 4 questions in the first response.

This also prevents repetition (see first point).  For one customer who had 14 (!!!!) questions that had to be answered in a flow (driven by regulatory/safety requirements) – we managed to prefill it with a one shot.

To explain, they had customers who had to use this 3 or 4 times a week and we noticed the human behaviour of pre-filling – and they loved us for refactoring to support it because by cutting in from 14 questions to 3 or 4 they could complete it in a minute rather than 6-8 (yes regulatory requirements made us break our own internal 5 minute rule).

Non prefil examplePre-fill example
Agent: Hi and welcome the Prawn restaurant, how can I help you today? Customer: Hi, it’s Mr Samuels, I would like to book a table for Thursday 7pm. Agent: Ok and what day and time would you like to book? Customer: Thursday 7pm. Agent: And do you want to sit inside or outside? Customer: inside is fine. Agent: Great, what name shall I put it under? Customer: Mr Samuels. Agent: Ok, that’s all booked, see you Thursday at 7, I’ll send an SMS to confirm the details.Agent: Hi and welcome the Prawn restaurant, how can I help you today? Customer: Hi, its Mr Samuels, I would like to book a table for Thursday 7pm. Agent: Got it, do you want to sit inside or outside? Customer: inside is fine. Agent: Ok, that’s all booked Mr Samuels, see you Thursday at 7, I’ll SMS you the confirmation.

Table 3.1 Example of pre-filling vs non pre-filling conversations

Notes:

  • In the example, we chose not to prefill the inside/outside question, but if he booked regularly, Mr Samuels would start including that in his greeting
  • Also note the closing statement in the pre-fill example – the name is an important part of the confirmation so we add it because it’s not repeating what he just specified to us in the non-prefill example.
  • In truth, this is a terrible concocted example because we should recognize Mr Samuel from his telephone number and greet him personally.
  • Notice the use of an SMS as well as evidence, with that we could potentially shorten the final statements because we are providing the confirmation & evidence separate to the call.
  • Finally, both conversations repeat the exact same prompts (which breaks our first non-repetitive guideline) – I deliberately did this to make the comparison of the two conversations easier.

Fortunately, with the move from turn based dialog platforms like Lex & Dialogflow to AI agents, this problem largely goes away – but dependent on how you manage your workflow, keep an eye out to ensure you don’t prevent customers pre-filling.

Serendipitous Education / dialog coaching

This is a term taken from Google’s voice playbook we mentioned in Chapter 1 that we call “dialog coaching”. We use it all the time when we have to capture multiple fields, we try and get the user to say as many of them as is reasonable/sensible in a few questions possible.

If we go back to the pre-filling example, dialog coaching would mean changing the prompt, after the “Hi and welcome the Prawn restaurant” greeting from:

“How can I help you today?” to:

“If you can get your name and a date and time, I can reserve you a table”

Note: We always separate the agent that greets the caller from the agents that fulfil their request(s) in what we referred to in Chapter 1  to as the AI concierge model, this separates the “greeting” and working out the customer “Intent” from the meat of getting things done (or tables booked in this instance)

This is so that when you engage your Table Booking agent – it has one single responsibility to worry about – booking tables, not handling anything else. 

Unfortunately, the example above isn’t using the AI Concierge model so the English sounds weird, particularly if all you want to do is speak to Finance – but it shows the principle of dialog coaching.

You can also provide this education at the end of the call if you are careful (see later guidance about providing direct robotic instructions)

Computer says no

People, particularly customers, don’t like to be told “No!”, its human nature. There was a comedy show called Little Britain that regularly ran sketches on this. Regrettably software development is very logical binary even black and white, true or false. So, the way we design for example an API to book an appointment can return an availability of true or false.

This is clean and simple, however in customer service we need to be less logical, and more emotional. When looking at your process, if your backend has a potential “computer says No” moment, try and change it a “No, but we have these alternatives” if you can. Also building these smarts into the backend will remove a lot of back and forth in your front end – assuming you control the backend.

When we get to putting this all together into an example, we will give

Empathy

The whole task of how to achieve empathy in human and computer interactions has been the study of many PHD thesis and research papers and is some of the most fascinating reading throughout my years of learning about voice.

At its core introducing empathy into the conversation improves perceptions of helpfulness and trustworthiness and I’ve linked my preferred paper in the subsequent table and references section.

A thorough exploration of this topic would take months, but here are the top 10 theories that have proven an improved perception on empathy in AI conversations:

Implementing empathy:

The guidelines in the following table should be in your mind whenever you are putting together a multi turn conversation, you should also review it for any direct dialogue that you instruct your bot to say.

But you can also include (somewhat wordily) empathy guidelines inside your prompt. By telling the model to favour first person plural, use modal verbs, favour collective reasoning over direct instructions and the present over the past tense will all lead to a more satisfying customer experience.

As with all prompting, its trial and error on an LLM-by-LLM basis as which particular prompt produces the best customer experience, but until we get a DSL for prompting – it is what it is 😊

Examples of Empathy Display in language of conversational AI:

TheoryNon-empathicExampleEmpathicExample
Person formThird Person“This problem can be solved”First plural or 2nd person“We can solve this together”
PronounsNo personal pronouns“The deposit”Personal with active tense & transitivity“Your deposit”
TensePast tense Present tense 
ExclamationsNo modal verbsNo please, may, would, should,No exclamations but lots of modal verbs 
Stimulating DialogueDirect neutral instructions“Click the button with Continue”Warm, joint dialog“Let us see”, “Shall we”, “Could you please share”
AcknowledgingNo acknowledgment.“Thank you for telling me that”acknowledges dialogue interactions“This is helpful”
Collective ReasoningPresents fact, results or conclusions factually“Based on case law”Language that focuses on thinking together jointly“Let us think this through” “The way I understand our situation is that…”
Imperative StatementNeutral statements / instructions“Click Proceed”Empathetic Imperative (Do + Infinitive)“Please do [verb]”
Showing UnderstandingNo interim questioning Interim questioning about emotional stateExpress understanding and try to adopt customer’s perspective and emotional state.
Caring statementsNo caring statements Affective statements of care 

Table 3.1 Empathy in Conversation [from the excellent: The Impact of Empathy Display in Language of Conversational AI: ]

Leave a Reply

Your email address will not be published. Required fields are marked *