Text To Voice Technology On WordPress
Text to speech online services can turn typed text into a computerised voice.
If you’ve watched any of the videos on this site, or seen the audio transcripts that appear at the top of every post, you can’t have failed to notice that they are all (apart from this one), voiced by Joanna at Amazon Polly.
Amazon Polly is an automatic text to speech engine that can automatically add spoken narration to content. Amazon Translate is an automatic text translation engine. They are both part of Amazon Web Services.
By using them together, you can not only translate text content into a number of other languages, but you could voice videos in a number of other languages too.
In my case, I am only applying Amazon Translate to blog post content. For now, my video voice-overs and captions are only available in English.
But Why Am I Using Text To Speech Converter?
If you’ve noticed my use of automated voice technology, you may think it’s because I don’t speak English, or that I possess the world’s most unappealing voice.
To prove that my English is not the issue, I’m recording this particular blog post in my natural speaking voice. I guess I’m also proving there is a real person behind this site.
However, all the other audio transcripts on the site will be generated by Amazon Polly.
I’d like to explain, why I made the decision to use a text to speech robot in the first place.
In the past I’ve created lots of how-to style videos using screen capture and my own voice. But over time, I began to find the process long-winded and for some reason, for me, extremely stressful.
The Problems Of Creating Screen Capture How To Videos
Sometimes I could do a screen capture video in one take and with no mistakes. But the truth is that this was a fairly rare occurrence. If I was feeling unwell on a particular day, it came through in my voice. If there was noise outside my window, I couldn’t make videos until it went away.
If the phone rang, I had to start again. If the Air Force decided to do practice flights over my house, which they often did, it sounded like the world was ending and I’d have to wait until it was over.
If I detected white noise on the soundtrack, I’d spend ages in Adobe Audition trying to remove the hissing sound. If the settings on my mic changed (which they sometimes did all by themselves) then suddenly my voice sounded too quiet, or maybe too loud.
This meant I’d be back inside Adobe Audition trying to equalise the sound levels.
I worked hard never to have an “um” or an “ah” – or to stumble over my words or to make mistakes in each video I created. This meant lots and lots of re-takes.
Sometimes I’d struggle to find a way to explain something and it would take 15 attempts to get one sentence right.
Worse that that, when you’re live screen-recording and having to re-take sections of it, you often have to also undo the steps you’ve now committed on the product you’re demonstrating.
This can mean having to delete files, remove items on the screen, clear the browser cache or, close windows, or open new accounts to show how to open an account that you’ve already opened. All that messing around takes time and focus away from the job in hand – creating the content.
Then when the video was completed, I’d find I’d missed a point and have to try inserting the information into the now finished video. This meant I became really good at video editing.
Then after years of voicing video content, I discovered Fleeq.
Fleeq Uses Automated Voice Technology
Fleeq is a one-stop shop that allowed me to create a series of screenshots which it turned into something like video, and then added narration to the video, in any language using Amazon Polly, just by typing the text I wanted the voice to say in English.
What was great about Fleeq was that this was all done in one SaaS product and all automatically. The difference was amazing. I could create how-to videos fast, and concentrate on the content, rather than worry constantly about sound quality or retakes.
But I Soon Found Some Issues
I found out that Fleeq was missing some of the features I’d grown used to and wanted. I contacted the developers to ask if they’d put these in, and my emails went unanswered. After some weeks, I wrote two times more, asking why they avoided contact with users.
I got a response saying they were busy and had already thought of all my suggestions and would implement most of what I suggested in time.
But I needed those things now so decided to cobble a solution together manually. These were the issues I had with Fleeq.
- the product didn’t allow the use of SSML on the voice scripts. This means you can’t add intonation to the automatic voice to improve the spoken result.
- There is no control on how the thumbnail appeared when embedding a Fleeq on a webpage. This means you could have spent time making your site look consistent, and then have to use these weird-looking thumbnails with colours you could not control.
I tried using my own thumbnails but came unstuck when users had to click the darned thing twice to get the Fleeq to play. This was due to the Fleeq development team having knobbled the autoplay on their stuff, with no user override available.
- The output was a Fleeq – which is their hosted presentation of your content. But what I wanted was a video file – an mp4. You can export Fleeqs to mp4s but this costs extra or is limited in some way according to the level of package you have.
- The output isn’t 16:9. Fleeqs are produced in a square-ish proportion. This does not fit with today’s standard of wide screen viewing. Nor does it allow whole windows to be screenshot unless you love the black tramline look down each side.
- They didn’t communicate very often.
With the above items fixed, the Fleeq would be amazing. Maybe they will deliver this functionality one day. Maybe they won’t.
What Did I Learn From Using Fleeq?
I learned that they’d a great idea, but their implementation sucked. I learned that for some reason, using a text to speech engine took the stress out of the whole video creation process for me.
I think my solution gives much better results that the Fleeqs I made previously. The downside was I lost some of the advantages of Fleeq.
For example I had to piece each video together myself with various products, making the process slower. Also I have not yet been able to replicate the multi-language translations of the captions and voice overs on videos.
But the videos I produce are much more polished, have great thumbnails, and they are actual videos so I can host them on Wistia and make use of all the facilities that Wistia offers.
How I Created The How-To Videos On This Site
- Make a folder on my drive for the website I’m working on, and then within that, a folder for each how to video I want to create.
- Make the browser window where I will be doing the screen capture, a specific size using the free Chrome Browser Extension called Window Resizer.
- Use Techsmith SnagIt to a screen capture, then save the screenshot in the folder I created earlier, giving the screenshot a filename with a number so I know what order they all go in.
- Load the screenshot into a tool that can create videos from screenshots. You could use Powerpoint, but I prefer to use Powtoon.
- Type a script for the screenshot into Amazon Polly and adjust using SSML until the speech delivery is good enough. Then download the mp3 sound file produced by Amazon Polly.
- Upload the mp3 file to the matching screenshot in Powtoon
- Use Powtoon to annotate the screenshot to add clarity and interest.
- Repeat steps 3 to 7 until all screen captures are done.
- Export to .mp4
I hope that adequately explains my use of text to speech services on this site.