A guide to building AI apps and artefacts

Chapter 1 - Adding speaking to programs

Ken Kahn, University of Oxford

Browser compatibility

This guide includes many interactive elements that currently only run well in the Chrome browser. This chapter relies upon there being speakers or headphones connected to your computer. There is a troubleshooting guide that should be consulted if problems are encountered.

Introduction

The importance of artificial Intelligence in society, the work place, and the economy is rapidly increasing. It is becoming an important part of many other technologies. AI conversational agents are becoming common in phones and PCs (Siri, Cortana, Google Assistant). Driverless vehicles are coming soon. Finance, medicine, and many other fields are relying more and more on AI. Even toys are beginning to embed AI.

Is there something to learn about intelligence in general by studying AI? AI researchers are attempting to give computers the abilities to perceive, to solve problems, to plan, and to reason. Perhaps this will shed some light on intelligence beyond what studying human and animal psychology can.

AI raises many questions:

Ways of learning AI

Broadly there are three approaches:

  1. Readings, lectures, and discussions about AI's history, ideas, and technologies
  2. Readings, lectures, and discussions about social issues impacted by AI
  3. Supporting hands-on experiences creating AI apps and artefacts

While this guide touches on 1 and 2; its focus is on 3. While some aspects of building AI applications are best suited for those with phds in the field, there are many projects that even young school students can do. If the power of speech and image recognition or the possibilities of machine learning are given easy programming interfaces then even beginners can use them in their creations.

Speech synthesis

Let's start by exploring speech synthesis. Is the ability to transform text into spoken speech really an example of AI? Humans need to spend a long time learning to read (out loud). But reading involves character recognition and comprehension. When a computer produces speech none of that is involved. Is generating speech an aspect of intelligent behaviour or it is just like a parrot speaking? Note that Alex the parrot clearly did more than "parroting" speech. Maybe speech synthesis is AI if the program uses AI techniques to generate the output. The article New AI Tech Can Mimic Any Voice assumes this is AI. What do you think?

Before discussing what is technically involved in synthesising speech let's look at how it can be used. Let's start with the simplest program block for speaking. Click on it to try it out. (Currently these blocks only work well in Chrome. Note that the Raspberry Pi has the open source version of Chrome called Chromium and it lacks builtin voices. The Mary TTS (text-to-speech) blocks below work in any browser.)

The set default language command sets what language is to be used if none is specified (see full featured speak command). It can be the language's name in the language or in English. Or you can use language code followed by the dialect code. E.g. en-GB for English as spoken in Great Britain. Language codes should be one of the IETF language tags.

Speaking sentences is more than speaking a series of words

Speech synthesizers do more than just say each word in a sentence. They attempt to produce a natural prosody, i.e., the intonation, tone, stress, and rhythm. They speak questions differently from declarative sentences. This is very different from old speech synthesisers such as the one Stephen Hawking uses.

Speaking things other than words

And for a human speaking numbers and punctuation signs can be a bit challenging. Is this a sign of intelligence?

Exercise. Try editing it to see how different numbers are spoken and how it handles special signs.

Problems combining speech and sound effects

Suppose one wanted the computer to speak, be interrupted by a door bell, and then respond accordingly. Try clicking this and you'll hear there is a problem:

The problem is that the sound effect begins at the same time as the first sentence. This is because the Speak block tells the browser to begin speaking and then continues with the next action. What we want is to have the door bell sound played after the first utterance is finished. For this we need a slightly more complex block that accepts actions to do after speaking:

Being able to express what should happen when speaking finishes is a generally useful ability. For example, the program may speak some explanation and then do something on the screen. In many of the examples in the next chapter the programs listen for speech only after the system finishes speaking.

Click to read an advanced topic

Code blocks that should run when some computation finishes are an instance of an advanced computer science concept called continuations. These are used to support asynchronous computations. The complexity introduced by continuations need not make things hard for students. They can easily place blocks in the Speak block without understanding the underlying computer science.

What is speech synthesis good for?

Speech synthesis is a good way for software to communicate with people that are blind or visually impaired. And for devices such as a talking clock or a robot without a display there are few alternatives. Or when the user's eyes need to attend to something else (e.g., a surgeon during an operation or a pilot flying a plane). Speech synthesis is often the best way to communicate with children or adults who have yet to learn to read. And speech synthesis gives those who physically cannot speak (e.g., Stephen Hawking) a way to communicate. Many argue that a conversation is often a more natural, friendly, and more pleasant interface to devices than displays, keyboards, mice, buttons, and touch sensors.

Stephen Hawking was able to communicate thanks to speech synthesis

Can speech synthesis be misused?

It is hard to imagine dangers with today's speech synthesis technology. As it gets better it may enable the automation of tasks that voice actors perform today. A more serious worry is that the next generation of speech synthesis may be able to fool people into thinking that someone said something they didn't. This is especially worrisome in combination with technology that can alter videos to change emotional expressions and lip synch.

Click to read an advanced topic

How does this work?

Modern speech synthesizers (also called "text-to-speech engines" or "TTS engines") start by preprocessing the text. Numbers, abbreviations, dates, and special characters are turned into words. For example, "42" becomes "forty two" (assuming the language is set to English). Next words are turned into phonemes. A phoneme is the unit of sound that words are made of. A phoneme is described phonetically so later stages can pronounce the words. For example, /ˈfəʊniːm/ is how the word "phoneme" is pronounced. This can be done by a dictionary look up or by using pronunciation rules. Finally sounds are generated as a sequence of pitch and volume changes (typically many thousands per second). There are three approaches to doing this last step: concatenative where recorded bits of speech are "glued" together, formant where each phoneme is synthesized, and articulatory where human tongues and vocal chords are simulated.

More details about how speech synthesis works can be found in the Additional resources section at the bottom of this page.

Controlling speech synthesis

A person can speak slow or fast, in a low or high pitch, quietly or loudly, and more. And different people have different voices and accents. To mirror these abilities a more complex block is available. In the following block a parameter value of 1 means the normal pitch, rate, or volume. Fractional values correspond to low pitch, slow rate, or low volume. How high the values can be for the opposite effect depends upon the browser and the voice used. Note that depending upon the voice and browser some of these parameters are ignored.

Exercise. A good strategy for trying things out is to vary one thing at a time. Try changing the rate to a number less than 1 and then greater than 1. Do the same for the pitch. Try different voices. You can see the list of voices available in your browser using the Get voice names block. Some of them are intended to speak other langauges.

Regarding the 'language' option, unfortunately many browsers currently ignore it. When a browser accepts it the argument should be one of the IETF language tags. It describes the language and dialect, e.g. en-GB. This library, however, will accept the name of the language in itself or in English.

Fortunately in some browsers (e.g., Chrome) some voices are associated with a language. When a non-English voice is given English to speak it typically speaks it with an accent. The selection of voices available depends upon the browser and the operating system. Click get voice names to see the list of voices and use the voice number in the Speak command. You can also the voice that matches to find a voice that matches the search terms.

A voice is more than just a voice. Listen to the two ways of reading dates.

If your browser has too few available voices or you want to try a different speech engine

Some browsers and some operating systems have very few (or no) voices. An alternative speech block uses a server in Germany that has its own set of voices.

Heteronyms cause problems

Heteronyms are words that mean and sound different but are spelled the same. The English word "Read" can sound like "reed" or "red" depending upon the tense. The English word "Lead" the verb sounds different from "lead" the metal. Some voices are better than others at using the correct pronunciation. Explore how different voices deal with "read" in the following. Try editing the sentences to see how they deal with "lead" and other words. Try this in another language.

Exercise. Can you think of any other words that are pronounced in more than one way? See if the voices can figure out the right one from the surrounding words.

Speaking non-words

It is interesting to explore how different voices deal with non-words such as the growl "grrrrrrrrrrrrr". Listen to how it sounds for each voice and then explore other non-words.

A sample program using speech synthesis

This following sample speech synthesis program uses random settings for rate, pitch, and voice. When one clicks on the picture of some numbers a random number is spoken with a random voices (and hence a random language) as well as a random pitch and rate. Numbers are well-suited for this since numbers are spoken in the language of the voice. The parrot repeats what it hears (see the next chapter). Hearing your utterances repeated in a strange voice with a random pitch and rate can be entertaining. As can hearing your speech repeated in a foreign accent. As with all sample programs the blocks implementing it can be easily found. Click on to see the blocks behind a program. If you click on the Parrot or Numbers sprite in the lower right you will see these sample programs. There is a full window version.

Possible project ideas using speech synthesis

Here are some ideas for projects using only speech synthesis, see the next chapter for ideas combining speech synthesis and recognition.

Is there even more to speech synthesis?

There are many more possibilities to control speech. A voice can sound like a robot, like whispering, like a crowd in a stadium or a chorus. You can experiment with these kinds of speech effects at the MARY TTS (text-to-speech) site by clicking on the 'Show Audio Effects' button.

Speech can be emotional; the speaker can sound angry, happy, sad, and so on. One way to experiment with this is to change the input type from TEXT to EMOTIONML at the MARY TTS site. There is a standard called Speech Synthesis Markup Language that provides great control over how the speech is generated. However, it has yet to be implemented in the voices one commonly finds available in browsers.

Listen to these samples of generated speech from the latest research from Google.

A very impressive use of controlled speech is the Giorgio Cam experiment by Google researchers. It combines computer vision and controlled speech to rap about what is in front of a webcam.

Additional resources

A complete description of the speech synthesis API that web browsers support.

A history of speech synthesis.

A long explanation of how speech synthesis is done by Explain that Stuff.

The Wikipedia entry is good.

Suggestions for using this guide

An excellent way to learn about new programming constructs such as the blocks in this guide is to tinker with them. Explore what each block can and can't do. See what effects the parameters have. Then try them out in your own projects perhaps starting with the simple blocks.

Learn about speech recognition

Go to the next chapter on speech recognition