If you use google TTS it may be that the TTS script is very slow to download audio files.
One thing you can do is to break a longer message up into several parts, so that TTS only downloads the relevant small part.
For example, if you have to sythesize "You have 36 apples in your basket" you could sythesize "You have", "36 apples" and " in your basket" as separate TTS calls. As content is cached, only the number will be downloaded.