A server in a network gathers textual information, such as news items, E-mail and the like. From that information, the server develops or identifies messages for use by individual subscribers. The same server that accumulates the text messages or another server in the network converts the textual information in each message to a sequence of speech synthesizer instructions. The converted messages, containing the sequences of speech synthesizer instructions, are transmitted to each identified subscriber's terminal device. A synthesizer in the terminal generates an audio waveform signal, representing the speech information, in response to the instructions. In the preferred embodiment, the terminals utilize concatenative type speech synthesizers, each of which has an associated vocabulary of stored fundamental sound samples. The instructions identify the sound samples, in order. The instructions also provide parameters for controlling characteristics of the signal generated during waveform synthesis for each sound sample in each sequence. For example, the instructions may specify the pitch, duration, amplitude, attack envelope and decay envelope for each sample. The division of the text to speech synthesis processing between the server and the terminals places the cost of the front end processing in the server, which is a shared resource. As a result, the hardware and software of the terminal may be relatively simple and inexpensive. Also, it is possible to upgrade the quality of the synthesis by upgrading the server software, without modifying the terminals.