(Im)mature tech

There are several AI use cases I'm looking forward to becoming mature, mainstream technologies. Text to Speech is one of them.

Talking to devices is the most unsocial, psychopath behavior of the last century. It's has been depicted as the epitome of the future in movies and books alike. Watching someone walking down the street talking to no-one is puzzling. But what’s even more astonishing is how bad transcription engines are.

You would think AI solves this, but some well known applications still struggle.

  • TTS in a coding application doesn’t understand tech terms, stack names, so you end up with a weird prompt, and instead of speeding you up, it slows you down. It forces you to rethink and rewrite your prompt, which of course you have to do manually.1
  • TTS in one of the mainstream AI applications I use can’t tell the difference between Spanish and Portuguese, so despite me speaking (rarely) in Spanish, it will transcribe to a perfect translation of Portuguese. If I hit send I, obviously (?), get the answer back in Portuguese, which I don’t quite understand as well as Spanish. If I try working around this selecting Spanish as the preferred language, when I chat in English (which is most of the time) the LLM will reply in Spanish (WTF?) As said, I’m really looking forward to a world where TTS is 99% accurate and it’s just taken for granted. A world in which you can speak to the machine and it will correctly transcribe what you say out loud. The present is very discouraging.

Note I didn’t even mention Siri, which seems to be running still the first version they ever released. Apple’s miracle of having such a terrible product out in the market for years, yet continue to almost dominate the cell phone market share.

Footnotes

  1. The only application I’ve found to excel at TTS is whisprflow.ai, which also shines in low volume voice. Furthermore, it correctly identifies words in mixed languages and context.