Document Translation - OpenAI Real-Time Speech API Beta Version
Add to your RSS feed3 October 20249 min readTable of Contents
The rise of artificial intelligence has introduced groundbreaking advancements in language processing and speech recognition. One of the latest innovations from OpenAI is the Real-Time Speech API in its Beta version, specifically designed for tasks like document translation. This API represents a leap forward in how we interact with technology, making it easier to transcribe, translate, and interpret spoken language in real-time.
What Is the OpenAI Real-Time Speech API?
The OpenAI Real-Time Speech API is a tool that enables developers and businesses to integrate AI-driven speech-to-text and translation features into their applications. The API is designed to process audio input, convert it into text, and perform translations in real-time across multiple languages. While still in the Beta phase, this API is already showing great potential in handling complex linguistic tasks such as multilingual transcription, live document translation, and audio-to-text conversion.
Oroginal Docs: - https://platform.openai.com/docs/guides/realtime
Key Features of the OpenAI Real-Time Speech API
- Real-Time Speech Recognition: The API is capable of converting spoken language into written text almost instantly, allowing for seamless speech-to-text functionality.
- Multilingual Support: It supports a wide range of languages and dialects, making it a powerful tool for translation across global audiences.
- Real-Time Translation: Not only can the API transcribe speech, but it can also translate it into different languages in real-time. This is particularly useful for live document translation or communication in multilingual settings.
- Customizable Voice Inputs: Developers can optimize the API to recognize different accents, languages, and even specific terminologies, making it highly adaptable to various industries such as healthcare, legal, or customer service.
- Scalability: The API can process large volumes of data efficiently, making it ideal for enterprises needing to translate or transcribe long speeches, meetings, or entire documents.
How Does the OpenAI Real-Time Speech API Work?
The OpenAI Real-Time Speech API uses advanced deep learning models, particularly transformer architectures and language models, to process spoken language. Here’s a simplified breakdown of how the process works for document translation:
- Audio Input: The API takes an audio file or a real-time voice stream as input.
- Speech Recognition: Using AI models, the speech is converted into written text. This involves understanding phonetics, grammar, and contextual nuances in the spoken language.
- Translation: The transcribed text is then translated into the target language using a neural machine translation model.
- Output: The final output is delivered in the form of translated text, ready to be integrated into documents, subtitles, or any other written form.
Quick Start
The real-time API is a WebSocket interface designed to run on a server. To help you get started quickly, we have created a console Demo to demonstrate some functions of this API. Although we don't recommend using this front-end pattern in a production environment , the app will help you visualize and inspect the event flow of your live API.
To get started quickly, download and configure this demo - https://github.com/openai/openai-realtime-console.
Overview
The real-time API is a stateful , event-based API that communicates via WebSockets. WebSocket connections require the following parameters:
-
URL: wss://api.openai.com/v1/realtime
-
Query parameters: ?model=gpt-4o-realtime-preview-2024-10-01
-
Request header : Authorization: Bearer YOUR_API_KEY OpenAI-Beta: realtime=v1
Here is a simple example of using popular ws lib to establish a socket connection, send a message from the client, and receive a response from the server in Node.js. It requires to be in the system environment variables OPENAI_API_KEY.
1 import WebSocket from "ws";2 3 const url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2024-10-01";4 const ws = new WebSocket(url, {5 headers: {6 "Authorization": "Bearer " + process.env.OPENAI_API_KEY,7 "OpenAI-Beta": "realtime=v1",8 },9 });10 11 ws.on("open", function open() {12 console.log("Connected to the server.");13 ws.send(JSON.stringify({14 type: "response.create",15 response: {16 modalities: ["text"],17 instructions: "Please assist the user.",18 }19 }));20 });21 22 ws.on("message", function incoming(message) {23 console.log(JSON.parse(message.toString()));24 });
A complete list of events emitted by the server, and events that the client can send, can be found in the API reference . Once connected, you will send and receive events representing text, audio, function calls, interrupts, configuration updates, and more.
Example
Here are some common examples of API functionality to get you started. These examples assume you have already instantiated a WebSocket.
Send user text
1 const event = {2 type: 'conversation.item.create',3 item: {4 type: 'message',5 role: 'user',6 content: [7 {8 type: 'input_text',9 text: 'Hello!'10 }11 ]12 }13 };14 ws.send(JSON.stringify(event));15 ws.send(JSON.stringify({type: 'response.create'}));
Send user audio
1 import fs from 'fs';2 import decodeAudio from 'audio-decode';3 4 // Converts Float32Array of audio data to PCM16 ArrayBuffer5 function floatTo16BitPCM(float32Array) {6 const buffer = new ArrayBuffer(float32Array.length * 2);7 const view = new DataView(buffer);8 let offset = 0;9 for (let i = 0; i < float32Array.length; i++, offset += 2) {10 let s = Math.max(-1, Math.min(1, float32Array[i]));11 view.setInt16(offset, s < 0 ? s * 0x8000 : s * 0x7fff, true);12 }13 return buffer;14 }15 16 // Converts a Float32Array to base64-encoded PCM16 data17 base64EncodeAudio(float32Array) {18 const arrayBuffer = floatTo16BitPCM(float32Array);19 let binary = '';20 let bytes = new Uint8Array(arrayBuffer);21 const chunkSize = 0x8000; // 32KB chunk size22 for (let i = 0; i < bytes.length; i += chunkSize) {23 let chunk = bytes.subarray(i, i + chunkSize);24 binary += String.fromCharCode.apply(null, chunk);25 }26 return btoa(binary);27 }28 29 // Using the "audio-decode" library to get raw audio bytes30 const myAudio = fs.readFileSync('./path/to/audio.wav');31 const audioBuffer = await decodeAudio(myAudio);32 const channelData = audioBuffer.getChannelData(0); // only accepts mono33 const base64AudioData = base64EncodeAudio(channelData);34 35 const event = {36 type: 'conversation.item.create',37 item: {38 type: 'message',39 role: 'user',40 content: [41 {42 type: 'input_audio',43 audio: base64AudioData44 }45 ]46 }47 };48 ws.send(JSON.stringify(event));49 ws.send(JSON.stringify({type: 'response.create'}));
Stream user audio
1 import fs from 'fs';2 import decodeAudio from 'audio-decode';34 // Convert Float32Array audio data to PCM16 ArrayBuffer5 function floatTo16BitPCM(float32Array) {6 const buffer = new ArrayBuffer(float32Array.length * 2);7 const view = new DataView(buffer);8 let offset = 0;9 for (let i = 0; i < float32Array.length; i++, offset += 2) {10 let s = Math.max(-1, Math.min(1, float32Array[i]));11 view.setInt16(offset, s < 0 ? s * 0x8000 : s * 0x7fff, true);12 }13 return buffer;14 }1516 // Convert Float32Array to base64-encoded PCM16 data17 function base64EncodeAudio(float32Array) {18 const arrayBuffer = floatTo16BitPCM(float32Array);19 let binary = '';20 let bytes = new Uint8Array(arrayBuffer);21 const chunkSize = 0x8000; // 32KB chunk size22 for (let i = 0; i < bytes.length; i += chunkSize) {23 let chunk = bytes.subarray(i, i + chunkSize);24 binary += String.fromCharCode.apply(null, chunk);25 }26 return btoa(binary);27 }2829 // Fill the audio buffer with the contents of three files,30 // then request the model to generate a response.31 const files = [32 './path/to/sample1.wav',33 './path/to/sample2.wav',34 './path/to/sample3.wav'35 ];3637 for (const filename of files) {38 const audioFile = fs.readFileSync(filename);39 const audioBuffer = await decodeAudio(audioFile);40 const channelData = audioBuffer.getChannelData(0);41 const base64Chunk = base64EncodeAudio(channelData);42 ws.send(JSON.stringify({43 type: 'input_audio_buffer.append',44 audio: base64Chunk45 }));46 }4748 ws.send(JSON.stringify({type: 'input_audio_buffer.commit'}));49 ws.send(JSON.stringify({type: 'response.create'}));
Summary:
- PCM Conversion: The function
floatTo16BitPCM()
converts audio data from Float32Array to PCM16 format. - Base64 Encoding: The
base64EncodeAudio()
function converts the PCM16 data into a base64-encoded string. - Audio Buffer and WebSocket: For each audio file, the code reads the file, decodes the audio, converts it to base64, and sends it to a WebSocket connection (ws.send()). After all files are processed, it commits the audio buffer and requests the model to generate a response.
Session
session refers to a single WebSocket connection between the client and the server.
After the client creates a session, it sends JSON-formatted events containing text and audio chunks. The server will respond with audio containing the speech output, a text transcription of that speech output, and a function call (if the client provided a function).
Example session object:
1 {2 id: "sess_001",3 object: "realtime.session",4 ...5 model: "gpt-4o",6 voice: "alloy",7 ...8 }
Integration Guide
The real-time API currently supports two audio formats: 24kHz single-channel, little-endian raw 16-bit PCM, and 8kHz G.711 (both u-law and a-law). Additional audio codecs will be supported soon.
Audio data should be provided as base64-encoded chunks of audio frames.
The following Python code example utilizes the pydub
library to create a valid audio message from the raw bytes of an audio file (assuming the file contains header information). For Node.js, the audio-decode
library offers tools to process raw audio tracks from various file formats.
1 import io2 import json3 from pydub import AudioSegment4 import base6456 def audio_to_item_create_event(audio_bytes: bytes) -> str:7 # Load the audio file from the byte stream8 audio = AudioSegment.from_file(io.BytesIO(audio_bytes))910 # Resample to 24kHz mono PCM1611 pcm_audio = audio.set_frame_rate(24000).set_channels(1).set_sample_width(2).raw_data1213 # Encode as a base64 string14 pcm_base64 = base64.b64encode(pcm_audio).decode()1516 event = {17 "type": "conversation.item.create",18 "item": {19 "type": "message",20 "role": "user",21 "content": [{22 "type": "input_audio",23 "audio": pcm_base6424 }]25 }26 }27 return json.dumps(event)
Sending an Event
To send an event to the API, you need to transmit a JSON string containing the event payload. Ensure that you are properly connected to the API before initiating the event.
1 // Ensure we are connected2 ws.on('open', () => {3 // Send an event4 const event = {5 type: 'conversation.item.create',6 item: {7 type: 'message',8 role: 'user',9 content: [10 {11 type: 'input_text',12 text: 'Hello!'13 }14 ]15 }16 };17 ws.send(JSON.stringify(event));18 });
Receive Events
1 ws.on('message', data => {2 try {3 const event = JSON.parse(data);4 console.log(event);5 } catch (e) {6 console.error(e);7 }8 });
Handling Errors
All errors are transmitted from the server to the client through events, specifically via the "error" event. These errors arise when the client events are invalid. You can manage these errors as follows:
1 const errorHandler = (error) => {2 console.log('type', error.type);3 console.log('code', error.code);4 console.log('message', error.message);5 console.log('param', error.param);6 console.log('event_id', error.event_id);7 };8 9 ws.on('message', data => {10 try {11 const event = JSON.parse(data);12 if (event.type === 'error') {13 const { error } = event;14 errorHandler(error);15 }16 } catch (e) {17 console.error(e);18 }19 });
You can send up to 9 client events and listen for 28 server events. For complete details, refer to the API specification on the reference page.
The Future of Real-Time Translation with OpenAI
he future of the OpenAI Real-Time Speech API is bright, with the potential to transform industries that rely heavily on multilingual communication. As AI technology continues to advance, we can expect improvements in accuracy, speed, and the range of supported languages.
Some exciting future developments may include:
- Improved Natural Language Understanding: Enhancements in understanding and processing different accents, dialects, and languages with more nuanced context.
- Greater Language Support: Expansion to support even more languages and regional dialects.
- Enhanced Customization: More options for tailoring the API to specific industries, use cases, and organizational needs.
Conclusion
The OpenAI Real-Time Speech API Beta Version is a powerful tool that introduces new possibilities for document translation, transcription, and speech processing. Whether it’s facilitating live translations at international events or streamlining internal workflows for global corporations, the API offers a fast, efficient, and cost-effective solution to real-time language processing.