The hardware aspects concern the complete integrated system delivered to the customer. We will distinguish several components that can be supplied by the technology provider or by the application developer. This chapter aims at decomposing such a system into its basic material components: platforms, speech processing boards, speech input/output interfaces , etc.
The platform is the ``black box'' that will be installed at the customer site. It can be a PC - or compatible - or a proprietary system .
The major requirements are related to its capabilities in terms of CPU (386, 486, Pentium, Power-PC , Motorola), memory (RAM, hard disk), data transfer rate through the PC-bus (ISA, EISA, PCI) or through dedicated buses, data transfer rate from the memory cache to the disk (when writing files), the capacity in ampères required to power the expansion boards. Dedicated boards, with DSPs, will be used within the free slots available on the platform. These may be half/full slots of the backplane.
The application developer has to know what the hardware configuration is that will respond to his needs and then state the requirements as above. He has to know if there is any means to use more than one platform using a LAN.
For speech processing specific boards (a dedicated board or off-the-shelf from Dialogic, Rhetorex, LSI, NMS, other vendors) or local CPU capabilities may be used. The application developer has to know how to install and configure the boards. He also has to know the capabilities offered by the board with respect to his application. For example if an application has to recognise 10 words then the developer may use a single speech processing board to process two calls simultaneously. So he has to know about the number of simultaneous sessions/calls that can be handled in real-time (how many telephone lines if the system is telephone-network oriented). In some applications this depends on the language and thus has to be taken into account (a TTS board may handle 3 calls for Spanish synthesis but only one for French).
The technology provider may also offer different boards with multi-channel configurations (Board A = 2 recognisers and Board B = 4 recognisers). The application developer has to know whether he can plug in either of those and still run his application.
There are also hardware constraints about the number of free slots, the power and memory requirements, etc. that are needed.
If the application is used within a desktop application, the speech input/output may use a sound board with an integrated microphone . If it is used within a telecommunication application then there is a need for an interface to the PTT network. This is provided by many vendors. As for the speech processing boards the application developer has to know what the requirements regarding his input/output interfaces are.
In many configurations one needs at least two boards: one to deal with telephone signalling (telephone interface) and a second one that implements the speech processing. The two boards use a particular bus to exchange speech data. The objective of such bus is to allow interaction between different boards implementing different applications from different technology providers on the same platform in an open environment . These are hardware and software implementations. The best known ones are:
The availability of such connections on the technology provided (hardware as well as APIs ) allows easy portability of the application if this is anticipated.
As mentioned above, the application developer has to know how to manage his CPU load when using a multi-channel system and should require a uniform and coherent response time on each channel . The technology provider should guarantee maximum response time in the worst conditions. The real-time aspect is related to a complete application and should be estimated with all the lines on. For example if the system prompts a beep before starting speech recognition the application developer has to compute the delay: the beep prompt plus the time needed to start recognition. This time is crucial as people may speak before the beep, which leads to a gap error.