Transcribe speech to text in real time using Amazon Transcribe with WebSocket

Amazon Transcribe is an automatic speech recognition (ASR) service that makes it easy for developers to add speech-to-text capability to applications. In November 2018, we added streaming transcriptions over HTTP/2 to Amazon Transcribe. This enabled users to pass a live audio stream to our service and, in return, receive text transcripts in real time. We are excited to share that we recently started supporting real-time transcriptions over the WebSocket protocol. WebSocket support makes streaming speech-to-text through Amazon Transcribe more accessible to a wider user base, especially for those who want to build browser or mobile-based applications.

In this blog post, we assume that you are aware of our streaming transcription service running over HTTP/2, and focus on showing you how to use the real-time offering over WebSocket. However, for reference on using HTTP/2, you can read our previous blog post and tech documentation.

What is WebSocket?

WebSocket is a full-duplex communication protocol built over TCP. The protocol was standardized by the IETF as RFC 6455 in 2011. WebSocket is suitable for long-lived connectivity whereby both the server and the client can transmit data over the same connection at the same time. It is also practical for cross-domain usage. Voila! No need to worry about cross-origin resource sharing (CORS) as there would be when using HTTP.

Using Amazon Transcribe streaming with WebSocket

To use Amazon Transcribe’s StartStreamTranscriptionWebSocket API, you first need to authorize your IAM user to use the Amazon Transcribe Streaming WebSocket. Go to the AWS Management Console, navigate to Identity & Access Management (IAM), and attach the following inline policy to your user in the AWS IAM console. Please refer to “To embed an inline policy for a user or role” for instructions on how to add permissions.

{
    "Version": "2012-10-17",
    "Statement": [
        "Sid": "transcribestreaming",
        "Effect": "Allow",
        "Action": "transcribe:StartStreamTranscriptionWebSocket",
        "Resource": "*"
    ]
}

Your upgrade request should be pre-signed with your AWS credentials using the AWS Signature Version 4. The request should contain the required parameters, namely sample-rate, language code, and media-encoding. You could optionally supply vocabulary-name to use a custom vocabulary. The StartStreamTranscriptionWebSocket API supports all of the languages that Amazon Transcribe streaming supports today. After your connection is upgraded to WebSocket, you can send your audio chunks as an AudioEvent of the event-stream encoding in the binary WebSocket frame. The response you get is the transcript JSON, which would also be event-stream encoded. For more details, please refer to our tech docs.

To demonstrate how you can power your application with Amazon Transcribe in real time with WebSocket, we built a sample static website. On the website you can enter your account credentials, choose one of the preferred languages, and start streaming. The complete sample code is available on GitHub. JavaScript developers, among others, may find this to be a helpful start. We’d love to see what other cool applications you can build using Amazon Transcribe streaming with WebSocket!

About the authors

Bhaskar Bagchi is an engineer in the Amazon Transcribe service team. Outside of work, Bhaskar enjoys photography and singing.

Karan Grover is an engineer in the Amazon Transcribe service team. Outside of work, Karan enjoys hiking and is a photography enthusiast.

Paul Zhao is a Product Manager at AWS Machine Learning. He manages the Amazon Transcribe service. Outside of work, Paul is a motorcycle enthusiast and avid woodworker.

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

JOB POSTINGS

CONTACT

Transcribe speech to text in real time using Amazon Transcribe with WebSocket

What is WebSocket?

Using Amazon Transcribe streaming with WebSocket

About the authors