Research from the Harvard Business Review found that companies which strategically apply AI to customer service will create new solutions and pursue continual improvement, while ensuring systems align with customers. AI enables companies to customize customer interactions and deliver fast, responsive services that cater to individual needs.
The Software Mind team designed and developed a solution that incorporates speech-to-text (STT) AI-backed technology which could greatly enhance how call centers operate, banks transcript calls with their clients, and companies archive data.
This article will focus on the technical aspect of this speech-to-text solution and how it can seamlessly work with AI platforms made available by Google, Microsoft and Amazon.
An innovative call recording solution
Software Mind deployed a commercially viable solution –Recorder – that leverages OpenSIPS and RTPengine modules. The combination of OpenSIPS (https://www.opensips.org/About/About), a multi-functional SIP server, with RTPengine (https://github.com/sipwise/rtpengine) by Sipwise, an efficient RTP proxy, forms a strong telco layer foundation for voice application servers. This pairing can serve various roles in a telecom operator’s network. Moreover, adding Java-based steering applications to control OpenSIPS (which, in turn, manages RTPengine) can provide a comprehensive application server tailored to an operator’s needs, ensure optimal time-to-market (TTM) and deliver cost efficiency.
In this solution, OpenSIPS manages SIP signaling at the media layer to enforce RTP packet proxying through a RTPengine. The RTPengine, in turn, loops these packets back to itself and stores them in files. Subsequently, custom Java-developed applications help process recordings and present them to end users through a graphical interface.
The Recorder solution is integrated into an IP Multimedia Subsystem (IMS) architecture and is currently managing thousands of simultaneous sessions, enabling call recording for different types of users: B2C VoLTE (MMTel) and B2B hosted on Cisco BroadWorks Application Server, as well as Webex for BroadWorks. Webex for BroadWorks service facilitates OTT (over-the-top) calls, which bypass an operator’s infrastructure and cannot be recorded by an operator. OpenSIPS + RTPengine-based Recorder architecture can record the fraction of Webex calls that go through the operator network, allowing operators to mimic the recording features natively available in Webex application with OTT calls.
Using OpenSIPS+ RTPengine in the Recorder solution provides operators with a significant advantage by enabling call recording. At the same time, it opens up a wide range of post-processing capabilities that are now available using AI, thereby enhancing an operator’s business potential even further. Let’s focus on the potential offered by pairing the Recorder solution with AI technology.
Introducing an AI transcription service to OpenSIPS and a RTPengine
Voice recordings stored using OpenSIPS and RTPengine architecture can be leveraged by an AI model in a speech-to-text (STT) service. Speech-to-text (STT) is an audio transcription service that converts received audio files into text files that contain the entire conversation. The existing transcription makes it easier to search, analyze, and extract insights from voice recordings
Let’s take a look at what the combined architecture can look like:
Depending on an operators’ possibilities and preferences, the AI instance attached to the picture can be either a local or cloud AI instance.
Many AI providers offer a STT transcription service, starting from the most prominent players like Google Cloud AI, Amazon Web Services, IBM, Microsoft Azure Cognitive services, and ending on smaller but still well-known ones like Rev.ai, Deepgram or OpenAI. All of them deliver speech recognition technology with broad language support, high accuracy and performance.
OpenAI Whisper in a speech-to-text solution
For a proof of concept (PoC) for demo purposes, our team decided to use the free open-source OpenAI service, to demonstrate the benefits of recording with a speech-to-text feature available together on OpenSIPS + RTPengine architecture.
OpenAI uses the Whisper model, an automatic speech recognition (ASR) model trained with a large dataset and diverse audio. Choosing Whisper was the main advantage of this PoC as it can be installed locally on the same machine where recorded files are stored, without opening additional network rules or an application serving AI API.
Benefits and challenges when using Whisper
Whisper requires Python 3.9.9 and PyTorch 1.10.1. Additionally, you must install the FFmpeg library for proper audio processing.
The accuracy of transcription changes depending on the language provided to the model. The English language offers the best possible results. The model was able to recognize the language automatically, and transcription had a low-level error rate (it’s our general observation based on a performed test. No statistical method was used for testing).
What is important is that the model performed a transcription without a language flag specified and was able to recognize English. Furthermore, English was flawlessly identified even if woven into a Polish dialogue and even thought the entire file was recognized as Polish.
As for other languages, our team noticed that native speakers’ accents were recognized correctly, but the language detected for non-native speakers wasn’t matched accurately, and the transcriptions had errors.
Results in various languages:
Polish
Detected language: Polish
[00:00.000 –> 00:11.000] Dzień dobry, chciałem zapytać o najnowszą ofertę
[00:11.000 –> 00:14.000] Dzień dobry, proszę o podanie identyfikatora klienta.
Spanish:
Detected language: Spanish
[00:00.000 –> 00:07.640] Me llamo Jorge, soy ingeniero, me quiero ir a casa y tengo un coche rojo.
Czech:
Detected language: Czech
[00:00.000 –> 00:06.160] Ahoj, menují se Tomáš, jsem injernír a právě řidím červené vozilo na cestu domů.
Post-processing AI features useful for businesses
The speech-to-text service results from analyzing recording files are presented as an example of an AI output that can be generated by AI. However, a broader range of post-processing outputs from AI is available, each of which can add significant value to operators’ businesses.
Here are some AI features available for voice files that may bring value for businesses:
Sentiment Analysis
Emotional tone detection: Analyzes the sentiment and emotional tone of a conversation, identifying whether a speaker is happy, angry, or frustrated. It can be particularly useful for customer service applications to gauge customer satisfaction and agent performance.
Call intent discovery
Intent analysis: Identifies the purpose of a call by analyzing the context and content of the conversation. This helps in understanding customer needs, improving service delivery, and tailoring responses. It can also categorize calls into various intents like support, sales, complaints, and inquiries.
The future of AI post-processing
By transforming raw voice recordings into valuable assets, this AI-powered post-processing add-on offers operators a significant competitive edge. Along with rich insights and advanced analytical capabilities, it enhances the OpenSIPS + RTPengine architecture and leverages an operator’s services by enabling their customers to make informed business decisions based on information acquired more quickly and efficiently using AI.
AI recording post-processing creates excellent opportunities – but having OpenSIPS + RTPengine enabled in each call flow is just begging for the running of online processing. For this, the Whisper instance can be switched to faster-whisper (https://github.com/SYSTRAN/faster-whisper) model, a reimplementation of OpenAI’s Whisper model using CTranslate2 that is four times faster than OpenAI/Whisper, despite using the same computing resources.
Additionally, incorporating whisper_streaming (https://github.com/ufal/whisper_streaming), with optimized sampling time adjusted to real-time streams, can further enhance a system and introduce even more opportunities to an operator’s customers. This approach may be much more demanding in terms of computing resources (CPU, GPU, RAM). Nevertheless, this path seems to be promising, as AI market usage growth continues in telco services.
If you are interested in exploring the possibilities offered by speech-to-text and AI, use this contact form to get in touch with one of our experts.
About the authorAnna Metkowska
Senior Telco DevOps Engineer
A senior telco DevOps engineer with 15+ years of experience in the telco industry, Anna works daily with SIP-based platforms like IMS and NGN. She designs and integrates these platforms with Cisco BroadWorks, custom-developed application servers, and hosting services tailored to customer needs. Anna is focused on delivering solutions that are comprehensive, functional, and fulfill customer needs.