Talks Tech #61: Speech to Text: Making Podcasts and AI Accessible

Talks Tech #61: Speech to Text: Making Podcasts and AI Accessible

Written by Akanksha Malik


This article was adapted from the podcast recording of Talks Tech #61.

Watch or Listen on YouTube

Akanksha Malik is a data and AI consultant, a Microsoft AI MVP, and an international speaker. After studying financial math and actuarial science at UCC Ireland, she realized she wanted to work with people as well as numbers. As a consultant, she works with clients to help them solve problems by making more informed decisions with data. She’s a firm believer in diversity and inclusivity. She loves machine learning and finds it becoming more accessible to everyone. She’s an advocate for women in STEM and is currently the network director of Women Who Code Melbourne and an advisory board member for Tech Diversity Lab.

The following is an edited transcript of her 2023 WWCode CONNECT Asia presentation, which was showcased as WWCode Podcast Talks Tech Episode #61. Watch or listen to the video linked above to follow along with this written tutorial.

So, I’m going to get right into it. And let’s basically chat about speech-to-text. The easiest way for me to start any session so I can ease myself into it is by talking about myself. So Molly’s done a great introduction, but that is the most flamboyantly Irish photo of me that has ever existed and probably will ever exist. And it’s my usual go-to to be like, “Right.” So this is a weird mix of an accent because I grew up in Ireland, and then I moved to Melbourne, and now I don’t know what the hell this accent is. So it’s a mix of everything, really. But, yeah, I’m actually joining everyone today from Perth, which is a little bit away from my usual home base, but I’m really, really excited to be here with everyone today. So, yeah, I’ve been a consultant in data and AI for the last five-ish years, maybe a little more. And when I’d moved over to Australia, where I knew nobody, I was like, “All right. I’m going to visit the other side of the world. This will be a great idea.” And then I realized I had no friends. So, the first few friends I made were through the Women Who Code. They held events that I could join.

The person who actually started Women Who Code Melbourne back along in 2018 and kicked it back up is the person whose house I’m staying at in Perth, which is just the most wholesome, full circle moment for me. But, yeah, I’ve been really, really lucky to be involved in the community that I have been. So that’s me, I think. Other than that, I work consulting a small… I’m an independent consultant. I do things like this where I talk about the stuff I’ve discovered and worked on recently. And I have a podcast called Paths Uncovered, which is what we’re going to talk about today. So, we’re going to focus on building a speech-to-text transcription of a podcast. And this came around a little bit of a roundabout way, so let’s start diving in. So I’ve got a podcast called Paths Uncovered, which is now transitioning into a bigger community where people from untraditional pathways into technology can find their home, especially having worked with Women Who Code. So many women who are just transitioning into the whole world of technology come in. They can come from a million different backgrounds because those opportunities might not have existed back then in terms of getting involved with STEM originally.

So I’ve had unbelievable people come in and have chats with me about how they got to where they did, ballerinas who went to Juilliard and had ballet degrees and used to perform at the Lincoln Theatre and stuff. When COVID hit, they transitioned into working in AI because that was the biggest thing at the time. I’ve had English teachers tell me about their careers and teaching and how they pivoted away from the teaching part to actually getting involved with tech. And usually, when you start seeing these stories, you’re like, “This is amazing, and I want to hear more.” And it was the most inspirational thing. And, yeah, look, it was a lockdown time. And I was like, “What’s one thing that people aren’t doing? Making podcasts. I should do that.” As if there weren’t enough of them out there. But I’ve had a really, really great time being able to share these stories out into the world. But it was really fun, and I was like, “Right, okay, what other platform can I put it onto? Make it as accessible as possible for people to join in.” We even recorded videos. And I was like, “Right, making sure that people can watch it if they want.” At least I know… I have friends who refuse to just listen to things on podcasts. They can’t enjoy that format. So, there’s YouTube available for them to watch.

And then I started realizing all of these things still weren’t really hitting the mark with accessibility because if you had any hearing disability or just even any little way of anything that stopped you from listening properly or had a thing… It basically wasn’t hitting the mark. So that’s where we started hitting into the accessibility area, where I was like, “Okay, so we need some speech-to-text audio happening. And within that, I need to be able to make sure I can separate out the speakers because there are two people speaking, and the person reading the transcript needs to understand who has what they’re saying and whose story is being talked about.” And the biggest accessibility area, from my personal perspective, is very much a part-time, on-the-side thing that I do because I enjoy doing it. It needed to be easy. This did not take a lot of time, and it just needed to be a simple solution I could build. I was talking about this with my dad at home, and I was telling him about this. I was like, “I need to build this thing. And I’ve been looking into solutions. They’re pretty expensive for what they are.” And Dad was like, “Aren’t you this person who does AI stuff all the time? You tell me all these projects you do. Can’t you build this yourself? What a consultant are you?” And I was like, “That’s a very good point.”

So, I started figuring out how to build this for myself. So, having looked around at a bunch of different tech stacks, this is the one I ended up using and what we’re going to end up going through in a couple of minutes. So, I used Azure Cognitive Service. So, that provided the ability to turn all the audio into text. Azure machine learning was the platform I used to do all the Python and Cody bits I needed. Then, the blob storage was where I stored all the audio that the services could then read the audio from and put the text back into for me to access it. So, that ended up being my tech stack. I’ve always enjoyed using Azure, and I use it quite regularly. So I was like, “This is again that ease of creation. Something I’m comfortable with, and it’s pretty quick and easy to get started into and not too expensive,” which was my other main concern.

So, what does Azure AI encompass? Let me give you a brief overview of everything in it. There are a few different aspects of AI within the Azure world. There’s the infrastructure. So that’s just its bare bones. If you want to do machine learning, we will provide you with the machines and give you whatever size you want. You pay for what you need from the data centers. The applied AI services are essentially the absolute other end of, “Let me just give you machines to do it.” This is where they’ve built the models, and they’ve basically integrated the use cases of those models, and they’re out-of-the-box solutions that are just ready to go. Cognitive services are one layer down from that where they’ve basically given you the models, but they’re not box solutions. So, you need to connect these into the right places and make them fit and work for you. But you don’t need to create brand-new machine learning models. Azure machine learning is just a platform where you can connect all these different things into one place.

So that was an overview of where everything sits. So, we’re going to be looking at applied AI services. So, this is just a quick overview of the different ones that exist. And the out-of-box services are actually really, really cool if people want to try these out, or maybe I’ll do a session later on, somewhere down the line in the next rest of the year, about how to use some of these because you basically open up the service, you connect it up. You pop your data in, and it just works. It’s amazing.

So things like the Form Recognizer, where you can put in forms that you use to extract information, you can basically teach it on about 5 to 10 different documents, and it starts understanding what needs to be taken out. It’s so, so amazing the way they’ve actually built it. And I’m a very big fan of not recreating the wheel. So, things are ready to go out of the box. What we’re going to be looking at is the cognitive services. So there are multiple different ones that they hit. So the vision for image-related stuff, language, if you want to do some translation pieces, speech, that’s the one we’re going to be hitting up, and then decision pieces, so anomaly detectors and things like that. So these are all pre-built models that you can connect to and then get them to do what you want to be able to fix and work on.

So speech, that’s what we’re going to be looking at because we’re doing speech-to-text. And within that the different aspects of it are it can do speech-to-text, it can do the other way of text-to-speech, and it can do some speech translation. Also, the big thing for me was it can do speaker recognition. So it can identify, “Okay, there are two people speaking in this audio. I can identify speakers one and two and then differentiate them as it goes down.”

So I’m like, “This is perfect. This is hitting all of my needs.” Let’s actually dive in and see. I don’t know where my speech services came up as a different automated, but how does it all work? Where does it all live? What does it look like? Let me give you a very brief overview of what it looks like actually to set up a speech service. So, the easiest place to go through and do it all is going to be the homepage for Microsoft Azure. And if you want to look up in… Figuring out where these things, so we’re looking at cognitive services. So, hit up Azure AI services to find the speech service. If you want to look up directly for a speech service, you can also do that. It will come up straight away. Once you go into it, you can create a new service. So once you’ve used Azure… If you’ve never used Azure before, that’s completely fine. They do keep their UI experience very similar. Once you’ve used one or two services, it becomes a lot easier to keep using and creating. I’d recommend using their documentation. It’s really well done in terms of following step-by-step instructions on how to set up different services or solutions. All you need to do is have a subscription where they can essentially bill you for your usage. So they track your usage, and you pay according to what you actually use.

A resource group, essentially, if you think about it, is just a giant box. Whenever you create a new project, you can assign it to a box where everything sits, and then you can delete the box so that you’re not being charged afterward. I have been charged for many things I forgot to delete. So that’s the one thing I always talk about a lot: to make sure you delete the things you don’t use or have finished using once you’ve tried them out. But you can create a new one directly within this. So, let’s call this Women Who Code Connect Asia. That’s it. That’s a new resource group that’s been connected. What location do you want your service to be sitting in? You can literally pick anywhere across the world. Have fun picking up whichever one. A recommendation is usually picked somewhere closest to where you’re currently based, just so your data sits close to you and the connection services. If you’re using this in an enterprise level, one thing to remember is you might have governance rules against your data. That it might not be allowed to leave a specific region or governance… For example, I’ve worked with government clients here in Australia. Their data can’t leave Australian shores, so we have to pick an Australian data center. So yeah, it’s just something to keep in mind.

So, the East US has already been picked. That’s all right. You give your service a name. I’m just going to call it Asia. The view will give you a little drop-down of the pricing tier. So there’s a few different ones. Free standard. If you’re trying this out for the first time, just use the free tier and a little hack; you’re allowed to have a free tier, one free tier in every region. So I’ve already used up the one in Australia East. So I can pick a random different one, say Canada Central, and then I have free access. So, the free tier gives you about 5 hours worth of audio transcription services. So, if you’ve got a pretty small use case, feel free to use the free pricing tier. It covers almost everything except a few different things, which you might need to jump to the standard level. So, at a base level, it’s pretty accessible. So, returning to the ease of creation, it was a free tier to use this on. I was like, “Right, this is perfect. It’s going to be cheap for me to do, and I know how to use these services. So I can mess around with them and see where it goes.”

So if I pick the free pricing tier, I click next, next, next. There isn’t much I need to change in anything else. It just reviews my validation of what I’ve set up, and it tells me it’s good to go or it should in about two seconds. And then I can hit create. And that’s all you must do to create a speech service. I won’t hit create right now because I can show you what a created one looks like. It only takes about a minute or two, but in the essence of saving time, this is what your free service would look like if you create the free one. And what they give you are keys. So this is how you’ll connect to the actual models and the service when you’re using them.

So, what does that actually look like once you’ve created the service? Let’s jump back into the slide deck. That’s how you have created your environment. So we’ve set up our environment where we’ve created the service, and now we need actually to be able to access it and talk to it. So this is going to be in Python. I basically just created a notebook, and I started writing some Python. So, an SDK was created specifically for the content of services in Python. There are a few other languages it also collaborates with, but Python is the one I’m most comfortable with. So that’s where I dived in. So you pull in the library, which is the SDK, and the next thing you need to give in terms of the setup is the key, which is the key I just showed you when you set up your service, and then the region. So you tell it, “Hey, this key relates to wherever you’ve made your service.” It could be the East US or the Canada one that I just showed you. Those are the two main things you need actually to create your connection.

So now that it’s connected, we’ve set up our service and we’ve connected it up into the right place, into our Python accessibility notebook, let’s try doing speech to text on just a local test file. So I just recorded a 10-second audio of myself speaking into my laptop, stored that file, and then connected it up. And this is what the code looks like. It looks like a lot of code on the screen. I’m not a big fan of having a lot of code on screen, but this is just giving you a quick look into what it looks like. All of this stuff, especially all the Python code, I pulled that in from the GitHub libraries that the Microsoft team have put together, especially for the speech services. So, none of this is groundbreaking code that I had to create myself. I’m just using what’s already been put in place by other people before me. Again, I’m not recreating the wheel. This must be easy to set up and doesn’t need to take a lot of time.

So we’re just basically doing from files, and you can see this is the speech key and then the region key. We’re making sure it’s already been set up. So we’re just putting that back in. We tell it where the audio file is, the testing one that I’ve created, and then we tell it like, “Hey, I would like you to do speech-to-text.” That’s essentially it. That’s all that really needs to go into this part. And then it gives you an output. So it basically says, “Running this test to make sure that my dot view big file can be… ” And I’m like, “That didn’t really make sense.” Which is where I realized speech to text stuff, when that happens, it happens in utterances. And that sounds like a weird word, but it basically thinks… This is when you’ve said a few things, and it’s categorized as a speech instance, and it stops there. So if you look at this, recognize once; that basically means recognize the first grouping of words. I would like it to recognize the whole file. So keep it going. So we add an extra little bit of code, which basically just says, “Keep going until there’s an actual end in the file.” So, you can see continuous recognition here.

So what that gives me then is the full result, which was, “Hey, I’m running this test to make sure that my… ” This is meant to say dot WAV, W-A-V, which is a type of audio file that it can be interpreted properly and uploaded. So, other than me just saying it weirdly, and this is going back to my weird Irish Australian mix of an accent, it did pretty well. Other than just that type of file utterance, it got everything right. This is amazing. This worked really well.

So, let’s try it out on some real-life stuff, which obviously meant a lot of podcast stuff for me. So I basically did the exact same thing. This is exactly the same code. I’m now just running it on the intro trailer. So I’m not running on a full 45-minutes episode, I’m just trying to run it on a 2-minute audio file to see what the results look like. And it did a pretty good job. This is me introducing myself, and it got my name right, which was shocking. There are days when I can’t even spell my own name correctly, let alone a model who has never heard of me spelling it correctly. Paths, apparently I say paths, the word, really weirdly. So it was like, “Are you saying Pats, as in the name of a person or pods?” Here we go.

But it did a really, really good job. What I think was really interesting was the grammar part. There are commas, and there’s capitalization on the names and titles of companies… It was really well done, and I was so shocked that I had nothing other than just, “Here is this file. Go read it.” That’s all I said to it, and it worked. This is amazing. This is just a close-up of the result itself if you want to have a read through them. But it did a really great job without having me go in and customize the model in any way. So I was amazed by it.

So I was like, “Right, let’s try this now with the big full episode, with files.” And I’m like, “I’m going to upload those files in a specific blob storage account, where it’s just going to sit on the cloud and they can talk to each other.” Let’s see how that worked. So when it’s in a blob, you need to tell it where the data is sitting. So there’s a few extra bits you need to do. So, just pointing to the right locations, and then this is the bit where it gets a little bit interesting. ‘Cause now I’m like, “It’s not just me talking in one audience. This is coming to… There’s going to be two speakers speaking ’cause it’s a full episode.”

There are a few bits and pieces of properties you can add in. So punctuation. Yep. I definitely want proper grammar and punctuation added to the translation or transcription. That would be great. And diarization enabled, that is the mark of, “Okay, there’s gonna be two speakers. I want you to identify who’s speaking when,” which is where also the word level timestamps. So basically, every time someone speaks, it timestamps that, and that’s what it bases some of the speaker identification on. Then, I need to tell it where I want the transcription to be stored. It’s like, “Okay, the destination, go put it here, read it from here, and do the middle bits.” You can also add in profanity filter modes so you can actually mask any profanity. You can try playing around with a lot of stuff in these properties. But at a base level, all I had to turn on was saying true for diarization enabled and the word timestamps.

So what does this output actually look like then? So now I’ve got a whole 45 minutes’ worth of audio going into the file. And it’s far on the model and stuff. Then, there is the other aspect of what one actually does with the output. So it comes out in a JSON format, and the stuff is a really, really hard giant file I had to start reading. And I’m like, “I don’t know how to read this. So I’m actually going to start getting Python to unravel it a little bit for me.” So again, this is just the code that I picked… I pulled this code together to play around with it and basically extract who is speaking. So, ones and twos are just speaker one and speaker two, and what they’re saying in a JSON dictionary is what I’ve pulled out of the big giant file.

Once I did that, I said, “Okay, this is still not functional in my daily use. How do I actually make it work for me?” It will be a transcription that people can read, and it makes sense, looks great, and looks pretty. This doesn’t look pretty to read to anybody. So I messed around with the code a bit more, and at least I got it put into a table. And I said, “Okay, this is a bit better.” But what you’ll see is it’s doing 1 1 1 1 1 1 2 2, and I want it to be, “Hey… ” The whole time speaker one is speaking, bundle that up into one sentence. I don’t want to have to go through grouping manually. So I played around with the code a little bit more, and I’m doing all the grouping stuff. I was like, “Okay, so the speaker one is Akanksha, speaker two is gonna be Rachel” in this example, who came in and talked about her experience moving into tech.

And then I got it to look something like this. I was like, “Okay, perfect. This is actually what I wanted.” Once I got it this far, I was like, “Okay, I actually can pull through and upload this into an accessible file that people can go and read.” So as a… I’m not going to jump into a recap yet. Let me show you what that file actually looks like. So this was the episode with Rachel. So, as you can see, you can find it on all the different platforms as well as YouTube, and then you can actually read the transcription below. It just basically goes through different speakers, and it has its own little bits. You can go through it if you want to read it. So I’m not saying it’s 100% accurate. I’m not too fond of it, I won’t lie about that, but it’s better than having nothing. And it was at that ease level it was doing such a good job straight away that I was like, “Oh, my God. I’ve made it; I’ve made something happen here.”

So now we can jump into a bit of a recap of what we all just went through. So I basically built a speech-to-text model that worked and translated the audio from the podcast into an accessible file that separated out the speakers, and I could just push that up onto the website for people to be able to access when they’re looking at an individual episode. There’s an ease of creation within this, where I didn’t have to pay… All I had to pay for was storage when I was storing those files on the service itself. So when I was reading them and then writing back. So, just that storage piece, which was a couple of cents, it was under $1 or $2 per month. And as well as that… The code I used was readily available on GitHub, and they’d built out sample documents from it. So I was basically able to manipulate that. It took about a day to build it into enough of a repeatable process where all I had to do was like, “Here’s the file. Off you go.” I’m working on making it much more automated over the next bit. So now it’s just like automatically, it’ll pick up whenever a new file is dropped in, and it’ll just start the process. That’s something I’m working on down the line, and there are a few other pieces that I want to work on, such as translation.

So it’s great that “Hey, oh, my God. Listen to it here, and then you can watch it here, and you can read it here, but it’s still all in English.” So if someone wanted to read it in a different language or listen to it in a different language, that is something that’s gonna hopefully come down the line, be an area that I can work on, especially with the availability of those services that exist with cognitive services. So thank you for listening. That QR code points directly to the website if you want to listen to some episodes or check those out. But yeah, I hope that gives you an overview of how you can use it. The whole point of this talk is that AI Cloud is always used at that level of enterprise. Usually, people say, “Oh, that’s a business thing.” You can use it in your day-to-day life and make things a lot more accessible for wherever you might actually be able to do that.

But why choose Azure among the many machine-learning speech platforms available?

The number one factor was that I have been using Azure because I’ve worked on a bunch of consulting projects where… Especially in the Australian market, Azure and AWS are the main competitors that most industries and staff use in terms of cloud services. So that was the base point of my experience with those two, and it was ready to go. The other end was that content had already been built. So again, this went back to that ease of creation. This was very much on a Saturday morning, and I did this on my own part-time. I don’t usually have that much time dedicated to spending days and days behind the scenes trying to work it out. And then the cost. So, I had $100 worth of credits sitting on the Azure account. So I was like, “Okay, great,” but then the other end is I didn’t even really need those. It was a $1 or $2 because the free tier catered to it all. And then there was code ready to go, and I was like, “Okay, this is just the easiest way of doing it.” So, that was essentially why I chose Az. It was essentially the cost aspect, ease of creation, and getting it ready to go at the window. So that is the main point here behind why I did this.

Many easier aspects are available to it now, but this example is more just to show you the different ways you could bring these things into your day-to-day life, especially for speech-text transcriptions. It’s if I upload this onto YouTube, I actually… It’s now started to do that as well. It can create a transcript for you if you really ask for it. It’s gotten pretty good, too. I could copy-paste and do that, but when I was building this at the start of the year, it didn’t have speaker identification; the separation was built in. Whereas for this one, I was like, “Okay. I have to turn it into a true yes or no. Flip that on, and it would do it for me.” So yeah, that was why I chose Azure, especially over Google, just because I haven’t used Google much at all. And mostly, that’s just because I haven’t had exposure to GCP as in the market here, really.

From a practical application standpoint, you may wonder if your viewership, enjoyment, or engagement has increased since building out a more accessible solution.

I haven’t jumped into the “Who is looking at this and who is reading it.” I haven’t. When I look at the page views and the analytics, I see that people are spending a little bit more time on each page before they jump out. So the other end is it is a podcast where most people think of podcasts, are like, “Yeah, I’m going to go find the link to get it to a certain page,” whether it’s Spotify or Apple Music or whatever it might be. So within that, yes and no. It makes me feel better that I’m like, “Yes, at least that exists, and it’s available.” I think the language piece, the translation piece, is going to be an interesting piece. If I can get that to work in terms of, “Okay, now I’ve got a fully functional transcription of the whole thing. Can I, A, translate that, just text to text, into a different language on demand for whoever wants to read it in whichever language? And once that’s done, can we actually get it to do the text-to-speech, where it reads that in that language and doesn’t sound too robotic?” And that will be an interesting piece of hitting a different market rather than just the reading part. But, yeah.

So I definitely think on both sides of things. In my personal opinion, I bet it has increased viewership just because it does have that extra element. But I think with what you’re talking about in terms of increasing language opportunities and being able to get it in the accessible tongue. Maybe that’s easier for you to understand or engage with. That makes a lot of sense, which would definitely raise it to that next level. This isn’t technically tech-based to what you were talking about, but it relates to something I think is really cool about what you did in the podcast itself. So, you talk about your podcast being these people finding these non-traditional paths into tech. From that perspective, is there any common trend among people transitioning into tech on this call? Are there any common trends that you’ve seen from your podcast interviews of just similar experiences that all of your guests have had, or at least pseudo-similar experiences?

You would laugh, but it’s so funny. Almost everyone has somehow moved into tech because there was a problem, and they just thought they’d start fixing it. They’re like, “I can do this.” This is such… And they’re pseudo-doing the tech already. They’re like, “Wait, if I can do this, I can do the rest too. It can be that… ” And it’s a running theme of there is a problem, and they’re like, “I can fix this.” And it doesn’t start like that for a lot of people. They’re working on something. They’re like, “If I just automate this, this would be easier.” And they’re like, “Wait, that’s not that bad. We can do this.” So that’s usually the trigger point… That’s been a lot of the stories of trigger points, of, “There was some problem that I was able to fix, which opened up a whole new world.”

I think the biggest thing, and this is going to sound very on-brand, especially where we are today, is communities. Find your community, find a community that supports that, and make a network, especially in markets such as Australia. I won’t speak much broader, but Australia has a pretty small tech market, and a lot goes towards moving to the people around you. Going to those meetings and meetups makes a huge difference. And that was the big thing more so. Every month, there are still new attendees who’ve never joined a meetup before. And they’re like, “This is our first one, and we’ve come to Women Who Code Melbourne.” I’m like, “Amazing. Come join us.”

But find that community that has got people who’ve been through it. So there are a few different boot camps that exist, I think, in Melbourne, and there’ll be people who are like, “Oh, I’m doing this boot camp.” And I can guarantee there’s always someone else who’s done that boot camp in that event at that time. They’re like, “This is what I struggled with.” They’re like, “Oh, thank God. It’s not just me.” And it’s usually that “it’s not just me,” which makes everything a lot easier when you can find that community around you. So it doesn’t have to be Women Who Code, but so many different communities will exist wherever you are. And if it is the Women Who Code, get that started. Come chat with us, and I’m sure someone will be happy to help you get that started where you are.