Transcribe speech to text in real time using Amazon Transcribe with WebSocket
Amazon Transcribe is an automatic speech recognition (ASR) service that makes it easy for developers to add speech-to-text capability to applications. In November 2018, we added streaming transcriptions over HTTP/2 to Amazon Transcribe. This enabled users to pass a live audio stream to our service and, in return, receive text transcripts in real time. We are excited to share that we recently started supporting real-time transcriptions over the WebSocket protocol. WebSocket support makes streaming speech-to-text through Amazon Transcribe more accessible to a wider user base, especially for those who want to build browser or mobile-based applications.
In this blog post, we assume that you are aware of our streaming transcription service running over HTTP/2, and focus on showing you how to use the real-time offering over WebSocket. However, for reference on using HTTP/2, you can read our previous blog post and tech documentation.
What is WebSocket?
WebSocket is a full-duplex communication protocol built over TCP. The protocol was standardized by the IETF as RFC 6455 in 2011. WebSocket is suitable for long-lived connectivity whereby both the server and the client can transmit data over the same connection at the same time. It is also practical for cross-domain usage. Voila! No need to worry about cross-origin resource sharing (CORS) as there would be when using HTTP.
Using Amazon Transcribe streaming with WebSocket
To use Amazon Transcribe’s StartStreamTranscriptionWebSocket API, you first need to authorize your IAM user to use the Amazon Transcribe Streaming WebSocket. Go to the AWS Management Console, navigate to Identity & Access Management (IAM), and attach the following inline policy to your user in the AWS IAM console. Please refer to “To embed an inline policy for a user or role” for instructions on how to add permissions.
Your upgrade request should be pre-signed with your AWS credentials using the AWS Signature Version 4. The request should contain the required parameters, namely sample-rate, language code, and media-encoding. You could optionally supply vocabulary-name to use a custom vocabulary. The StartStreamTranscriptionWebSocket API supports all of the languages that Amazon Transcribe streaming supports today. After your connection is upgraded to WebSocket, you can send your audio chunks as an AudioEvent of the event-stream encoding in the binary WebSocket frame. The response you get is the transcript JSON, which would also be event-stream encoded. For more details, please refer to our tech docs.
About the authors
Bhaskar Bagchi is an engineer in the Amazon Transcribe service team. Outside of work, Bhaskar enjoys photography and singing.
Karan Grover is an engineer in the Amazon Transcribe service team. Outside of work, Karan enjoys hiking and is a photography enthusiast.
Paul Zhao is a Product Manager at AWS Machine Learning. He manages the Amazon Transcribe service. Outside of work, Paul is a motorcycle enthusiast and avid woodworker.