Text-to-speech Programs

I finally got around to shopping for a text-to-speech program for
dictation practice and audiobook making. Here’s my experience, for
anyone interested:

First, those wo’ve tried Windows XP’s built-in TTS program (“Narrator”)
might be pleased to hear that third-party products sound incomparably
better. I looked at least a dozen commercial TTS developers, about
half of which offer desktop products. I narrowed those down by three

1) High quality sound
2) Output can be directed to a sound (mp3) file
3) Reading speed can be adjusted in WPM (many products only provide
presets like “fast”, “medium”, “slow”. Note, I didn’t find any that
could be adjusted in “1.4-syllables-per-minute”!)

This left just two candidates: AT&T’s “Natural Voices”
(www.naturalvoices.att.com), and Cepstral (www.cepstral.com).
The internet concensus seemed to be that AT&T’s are clear and away
the best sounding. I wondered if that concensus is just due to the
product’s high profile; their 22KHz voices are unarguably good, but in
some aspects, like phrasing and dynamic range, I thought IBM’s newer
products, and Loquendo’s, were better (alas, neither of those offer
end-user desktop packages).

Cepstral’s various voices are definitely in the same class as AT&T’s;
I think their phrasing is even better. Cepstral offers versions not
only for Windows, but also OSX, Linux and Solaris.

AT&T’s 16KHz “Natural Voices” can be bought at www.nextuptech.com for
$35USD + $19.95USD for an mp3 file-creating extention. I’m not sure
where the 22KHz versions can be purchased. www.cepstral.com sells
their selection of voices (with file-output capability built-in) for
$29.99USD each. Both stores offer online demos, and Cepstral offers a
limited free-trial version.

I opted for Cepstral’s 22KHz “Callie” voice for Linux, and I’m really
happy with it. The only reservations I’d note are about computer voices
in general, and for transcription practice, those aren’t problems.
Pronunciation seems about 99% accurate; most of its errors are on
homonyms, present-/past-tense ambiguity, etc—occasionally distracting,
but almost negligible. Cepstral’s WPM adjustment seems to affect the
words and not the breaks between them, which sounds unnatural as speeds
drop below 80WPM; maybe that’s standard behaviour for such programs?
In any case, it doesn’t make it hard to understand, and there are ways
to change that behaviour. It’s great to be able to take material I’m
actually interested in hearing, and to be able to bump the speed up at
my own pace.

The program has been even more effective than I expected for audiobook
making. I’ve complemented Cepstral’s phrasing ability, but lest you
expect Laurence Olivier, my wife says they all sound like they’re reading
a shopping list. I don’t notice that after about ten minutes, and the
voice has so far sustained my comprehension all the way through two large
classic novels, a couple of books of the Bible, and some dense philosophy
(not that I have that much free time!—I just drive a lot).

Creating an audiobook from some www.gutenberg.org plaintext is a snap.
Creating a nicely paced audiobook with proper pauses on paragraphs,
em-dashes, braces, emphases on exclamation points, and organised into
chapter-per-mp3-file is also easy—in unix. I’m not sure how you’d
do that much fancy chapter breaking, SSML markup and mp3 making under
Windows except by hand.

In sum, I’d recommend this program to anyone looking for motivation to
practice transcription.

(by routine-sibling
for everyone)

2 comments Add yours
  1. It's interesting that you are interested in text-to-speech also. You might want to look at NextUp's forum (http://www.nextup.com/phpBB2/index.php). There are quite a few comments on various voices there. I bought their TextAloud product and a number of voices to go with it. It runs only on Windows, though, not on Linux.

    I'm surprised that you found AT&T voices were rated so highly. AT&T seems to have quit development on voices, and the current voices are several years old. My impression was that the AT&T voices were sometimes choppy and made errors in pronunciation. They may have been the state of the art at one time, but I think better choices are now available.

    From an accuracy standpoint, the best American English voices I think are from Neospeech or Nuance. The Neospeech voices I think have an advantage in that they have a pronunciation editor that can more accurately specify the pronunciation than the TextAloud editor.

    On the other hand, the most pleasant voice I think is Heather from the Acapela Group. She sounds the same as Cepstral's Callie, since apparently the same voice talent was used for both, but listening to the sample at NextUp's site, I thought Heather sounded smoother than Callie. Heather's pronunciation is not as accurate as the Neospeech or Nuance voices, but I just like her voice better. Heather also comes with her own pronunciation editor, although it seems that some words can't be made satisfactory no matter what you do. By the way, perhaps I'm pickier, but I think all the voices make way too many pronunciation errors.

    I also agree with your wife–they do sound like they're reading a shopping list. They sound OK when they're reading the news, but it is really apparent when they are reading fiction or anything that requires emotion.

  2. > perhaps I'm pickier, but I think all the voices make way too many pronunciation errors…

    Cepstral's voices have a pronunciation editor, too, but actually, I try hard to avoid getting into that, and deliberately settle with the limitations and imperfections. Else, once I begin tinker…

    > I also agree with your wife–they do sound like they're reading a shopping list.

    Yes. Expository, essays, reports are alright; but listening to a tts reading of a novel demands a lot of patience—maybe more than it's worth if you can afford to buy the commercial audio book edition, or the time to read it.

    In fact it seems more like reading than listening. The program is *consistent* at least; so one can "interpret" its idiosyncrasies. And that's often what I find myself doing since, as you say, the emotion isn't there to cue, and odd pronunciations or vocalizations are ambiguous. I often have to hear out a whole sentence or some bigger context to know what's meant—like I do when I read Gregg!

    For me it's a more-than-welcome comprimise for custom transcription practice material.

Leave a Reply