rise of the robot voice

it's here, and it sucks

The internet lets people express their individuality and creativity, and broadcast it to millions of people on a scale unparalleled by all of human history. At the same time, oddly enough, the internet has a devastating homogenizing effect that makes society boring, less dynamic, and less interesting.

This can largely be attributed to black-box recommendation algorithms and centralized social-media platforms. Instagram has billions of users, and yet if I say “Instagram aesthetic” you know what I mean. Same with “YouTube face.” Or “vlogger voice” (“hey guys!”). Or “Facebook headlines.” Recommendation algorithms create a right way to express one’s self, and countless wrong ways.

TikTok is currently the the most concerning platform in this regard because of how easy the app makes it to repurpose other peoples’ audio and video. Sounds are recycled, videos are stitched together or duetted, I’ve seen so many bedrooms with LED light strips that I (old man) regularly ask myself, “When the hell did LED light strips become a thing? Did we run out of normal bulbs? Why does Gen Z need 24/7 access to programmable mood lighting?” If you look at an apartment building and the light emanating from it is anything other than fluorescent or a soft yellow, that room’s occupant definitely knows George Lopez from Nick At Nite rather than ABC. Maybe it’s just the corner of TikTok my browsing habits have cultivated, but I doubt it.

A TikTok trend that amused me earlier this year worked like this: dog owners would write stories in their pet’s voice, and then have Siri convert that to speech, so it was as if the pet was narrating it in stiff robot voice.

At some point in the last week or two, TikTok rolled out an easy-to-use text-to-speech feature. It already seems to be in widespread use. Like any other megaplatform, TikTok has the ability to make technologies, behaviors, and presentation styles standard just by putting them in front of enough people. If enough other users are making their videos in a certain way, you might feel compelled to as well. I didn’t have to look hard for the text-to-speech feature in the editor — the app suggested that I try it out as soon as I wrote a test caption (“pee pee poo poo”). With a couple of taps, the text captions overlayed on a video are read aloud by a robotic woman’s voice in sync with the video. This worries me immensely. Here are some examples I came across: 1, 2, 3.

(I’ll pause here to note that, duh-doyee, text-to-speech has been around forever and has tons of valid uses. It is essential technology for computer users with visual impairments who cannot see text on display, and for people who have lost the ability to speak.)

First, I guess it’s worth asking why — aside from the aforementioned accessibility concerns — this feature is even being promoted. I’d theorize (and to be clear, this is just a theory but if you work at TikTok, feel free to email me) that videos which require the viewer to read retain viewers at a lower rate than videos that viewers can just listen to. For a long time, and maybe still, media companies on Facebook produced what amounted to elaborate PowerPoint slides — news footage overlaid with text captions and soundtracked by stock music which could be digested on mute. No need to hire a voiceover artist and spend half a day recording narration.

The next step in this format is probably whatever TikTok is going for here, translating that video form from active to passive. Occam’s Razor would say that this text-to-speech tool is an engagement boosting feature for more than users with visual impairments. (Also, sidenote, if you are that type of user and love TikTok, I would like to know more about that, so you should also feel free to email me).

What’s interesting is that, despite the obvious justification of accessibility, the implementation suggests something else. If a user turns on text-to-speech, the text plays for all viewers, regardless of whether or not they use a screen reader. The test-to-speech is not an additional, optional layer, it is baked in for everyone. This makes text-to-speech less on accessibility feature and more of an aesthetic, creative choice made by the content creator: Do they want to record voiceover themselves, or just have a computer do it?

The main reason this features sets off some alarm bells in my head is that, like, man, if you thought every aspiring influencer and content creator sounded the same before, imagine what happens when they actually sound the same. The potential cascading effects of this are just mortifying when I play them out in my head. The mainstreaming of completely stilted robot voice with odd, lilting emotional tinges and quirky mispronunciations, making content creation (I absolutely despise that term) even more brainless and frictionless.

It does not bode well for the future. And guess what: the future is now. Because I now have the immense displeasure of informing you of Reddit channels on YouTube. These channels, as the language implies, combine two of my least favorite websites: Reddit, where memes go to die, and YouTube, which literally creates terrorists.

Reddit channels aggregate popular threads, convert them to audio with text-to-speech software, and blast them out on YouTube for hundreds of thousands of views apiece.

I guess in some ways its fun to see content from Reddit, the clearinghouse for bootleg viral content, get its just desserts, but the popularity of these channels is devastating and depressing. To each there own, but also: seriously? This? You can pirate almost anything via the internet, and yet this robot voice stuff has an audience? I’m close to swerving into crotchety old man territory again, but jeeeeeesus christ.

It’s not difficult to find tutorials online for the almost total automation of the production of these videos, nor is it difficult to find YouTube videos explaining “How To Start Youtube CASH COW CHANNEL (TEXT TO SPEECH TUTORIAL) (2021 Youtube Cash Cow Tutorial)” and “BEST text to speech Voice Over solution for YouTubers || 2020” and “5 Best Text To Speech Software For YouTube Videos (#1 Real Human Voice) 2020/2021.”

I don’t really know what can or should be done about any of this, it just makes me worried for the future of the mainstream internet. When platforms like TikTok endorse and promote this kind of content, they’re also ramming it into acceptability. Centralized platforms have already eradicated so much individuality from the internet that I shudder to think what they’ll do when multimedia creation doesn’t even require a human touch. I’m probably being paranoid.

An earlier version of this piece contained some clumsy language regarding accessibility. The writer responsible has been sacked and the mistakes have been corrected.


Thank you for reading BNet. Sorry this is late, but technically it still came out on Tuesday.