How to Create a Speech-to-Text-to-Speech Program

It’s been exactly a decade since I started attending GeekCon (yes, a geeks’ conference)—a weekend-long hackathon-makeathon in which all projects must be useless and just for fun—and this year there was an exciting twist: all projects were required to incorporate some form of AI.

My group’s project was a speech-to-text-to-speech game, and here’s how it works: the user selects a character to talk to and then verbally expresses anything they’d like to the character. This spoken input is transcribed and sent to ChatGPT, which responds as if it were the character. The response is then read aloud using Speech-to-Text-to-Speech Program technology.

Now that the game is up and running, bringing laughs and fun, I’ve crafted this how-to guide to help you create a similar game on your own. Throughout the article, we’ll also explore the various considerations and decisions we made during the hackathon.

Want to see the full code? Here is the link!

The Program’s Flow

Once the server is running, the user will hear the app prompting them to select the character they want to talk to and start conversing. To speak out loud, they should press and hold a key on the keyboard while talking. When they finish talking (and release the key), their recording will be transcribed by Whisper (a text-to-speech model by OpenAI), The transcription will be sent to ChatGPT for a response, which will then be read out loud using a text-to-speech library for the user to hear.



Note: The project was developed on a Windows operating system and includes the following features:pyttsx3 library, which lacks compatibility with M1/M2 chips. As pyttsx3 is not supported on Mac, users should consider exploring alternative text-to-speech libraries that are compatible with macOS environments.

Openai Integration

I used two OpenAI models: Whisper, for speech-to-text dictation and the ChatGPTAPI, which generates responses based on the user’s input for their selected figure. The pricing model is cost-effective, with my current bill totaling less than $1 for all my usage. I initiated the service with a $5 deposit, which has not been depleted yet and will remain valid for another year.
I’m not receiving any payments or benefits from OpenAI for writing this.

Once you get your OpenAIset, the API key is an environment variable to use when making API calls. Be careful not to include the key in the codebase or any public location, and avoid sharing it unsafely.

Speech to Text: Create Transcription

The development of the speech-to-text feature was achieved using Whisper, an OpenAI model.

Below is the code clip for the function that’s responsible for transcription:

async def get_transcript(audio_file_path: str, 
                         text_to_draw_while_waiting: str) -> Optional[str]:
    openai.api_key = os.environ.get("OPENAI_API_KEY")
    audio_file = open(audio_file_path, "rb")
    transcript = None

    async def transcribe_audio() -> None:
        nonlocal transcript
            response = openai.Audio.transcribe(
                model="whisper-1", file=audio_file, language="en")
            transcript = response.get("text")
        except Exception as e:

    draw_thread = Thread(target=print_text_while_waiting_for_transcription(

    transcription_task = asyncio.create_task(transcribe_audio())
    await transcription_task

    if transcript is None:
        print("Transcription not available within the specified timeout.")

    return transcript

This function is marked as asynchronous (async) since the API call may take some time to return a response, and we await it to ensure that the program doesn’t progress until the response is received.

As you can see, the get_transcript function also invokes the print_text_while_waiting_for_transcription function. Why? Since obtaining the transcription is a time-consuming task, we wanted to keep the user informed that the program is actively processing their request and not stuck or unresponsive. As a result, this text is gradually printed as the user awaits the next step.

String Matching Using FuzzyWuzzy for Text Comparison

After transcribing the speech into text, we either used it as is or attempted to compare it with an existing string. The comparison use cases included selecting a figure from a predefined list of options, deciding whether to continue playing or not, and, when choosing to continue, deciding whether to choose a new figure or stick with the current one.

In these cases, we aimed to compare the user’s spoken input transcription with the options in our lists, leading us to decide to use the comparison FuzzyWuzzy String matching library was used to select the closest option from a list, as long as the matching score was above a predefined threshold. Here’s a snippet of the function:

def detect_chosen_option_from_transcript(

        transcript: str, options: List[str]) -> str:
    best_match_score = 0
    best_match = ""

    for option in options:
        score = fuzz.token_set_ratio(transcript.lower(), option.lower())
        if score > best_match_score:
            best_match_score = score
            best_match = option

    if best_match_score >= 70:
        return best_match
        return ""

If you want to learn more about the FuzzyWuzzy library and its functions — you can check out an article I wrote about it here.

Get ChatGPT Response

Once we have the transcription, we can send it over to ChatGPT to get a response.

For each ChatGPT request, we added a prompt asking for a short and funny response. We also told ChatGPT which figure to pretend to be.

So our function looked as follows:

def get_gpt_response(transcript: str, chosen_figure: str) -> str:
    system_instructions = get_system_instructions(chosen_figure)
        return make_openai_request(
    except Exception as e:
        logging.error(f"could not get ChatGPT response. error: {str(e)}")
        raise e

and the system instructions looked as follows:

def get_system_instructions(figure: str) -> str:
    return f"You provide funny and short answers. You are: {figure}"

Text to Speech

We chose a Python library called for the text-to-speech component pyttsx3. This choice was not only straightforward to implement but also offered several additional advantages. It’s free of charge, provides two voice options — male and female — and allows you to select the speaking rate in words per minute (speech speed).

When a user starts the game, they pick a character from a predefined list of options. If we couldn’t find a match for what they said within our list, we’d randomly select a character from our “fallback figures” list. In both lists, each character was associated with a gender, so our text-to-speech function also received the voice ID corresponding to the selected gender.

This is what our text-to-speech function looked like:

def text_to_speech(text: str, gender: str = Gender.FEMALE.value) -> None:
    engine = pyttsx3.init()

    engine.setProperty("rate", WORDS_PER_MINUTE_RATE)
    voices = engine.getProperty("voices")
    voice_id = voices[0].id if gender == "male" else voices[1].id
    engine.setProperty("voice", voice_id)


The Main Flow

Now that we’ve more or less got all the pieces of our app in place, it’s time to dive into the gameplay! The main flow is outlined below. You might notice some functions we haven’t delved into (e.g. choose_figureplay_round), but you can explore the full code by checking out the repo. Eventually, most of these higher-level functions tie into the internal functions we’ve covered above.

Here’s a snippet of the main game flow:

import asyncio

from src.handle_transcript import text_to_speech
from src.main_flow_helpers import choose_figure, start, play_round, \

def farewell() -> None:
    farewell_message = "It was great having you here, " \
                       "hope to see you again soon!"

async def get_round_settings(figure: str) -> dict:
    new_round_choice = await is_another_round()
    if new_round_choice == "new figure":
        return {"figure": "", "another_round": True}
    elif new_round_choice == "no":
        return {"figure": "", "another_round": False}
    elif new_round_choice == "yes":
        return {"figure": figure, "another_round": True}

async def main():
    another_round = True
    figure = ""

    while True:
        if not figure:
            figure = await choose_figure()

        while another_round:
            await play_round(chosen_figure=figure)
            user_choices = await get_round_settings(figure)
            figure, another_round = \
                user_choices.get("figure"), user_choices.get("another_round")
            if not figure:

        if another_round is False:

if __name__ == "__main__":

The Roads Not Taken

We had several ideas for the hackathon that we didn’t have the opportunity to implement. This was either because we couldn’t find a suitable API during the weekend, or because time constraints prevented us from developing certain features. These are the paths we didn’t pursue for this project:

Matching the Response Voice with the Chosen Figure’s “Actual” Voice

Imagine if the user could choose to converse with Shrek, Trump, or Oprah Winfrey. We aimed for our text-to-speech library or API to deliver responses using voices that matched the selected personality. However, we were unable to find a library or API during the hackathon that provided this feature at an affordable price. We remain open to suggestions if you have any.

Let the users Talk to “themselves.”

One intriguing idea was to have users provide a vocal sample of themselves speaking, which would be used to train a model. The user could then choose the tone of the responses (affirmative and supportive, sarcastic, angry, etc.) by reading them aloud in their voice after ChatGPT generated them. However, we were unable to find an API that supported this within the constraints of the hackathon.

Adding a Frontend to Our Application

Our original plan was to incorporate a front-end component into our application. However, due to a last-minute change in the number of participants in our group, we decided to prioritize backend development. As a result, the application currently operates on the command line interface (CLI) and lacks a frontend component.

Additional Improvements We Have In Mind

Latency is currently my primary concern. Multiple components in the flow have relatively high latency, which, in my opinion, slightly detracts from the user experience. For instance, the time it takes from completing the audio input to receiving a transcription, and the delay from when the user presses a button to when the system actually starts recording the audio. Consequently, if the user begins speaking immediately after pressing the key, there will be at least one second of audio that won’t be recorded due to this lag.

Link to the Repo & Credits

Want to see the whole project? It’s right here!

Also, warm credit goes to Lior Yardeni, my hackathon partner with whom I created this game.

Summing Up

In this article, we explored the process of creating a speech-to-text-to-speech game using Python and integrating it with AI. We’ve used the Whisper model by OpenAI for speech recognition, played around with the FuzzyWuzzy library for text matching, tapped into ChatGPT’s conversational magic via their developer API, and brought it all to life with pyttsx3 for text-to-speech. While OpenAI’s services (Whisper and ChatGPT for developers) do come with a modest cost, it’s budget-friendly.

We hope you’ve found this guide enlightening and that it’s motivating you to embark on your projects.

About admin

Check Also

Top 50+ Salesforce Interview Questions and Answers for 2024

Sure, here are top 50+ Salesforce interview questions and answers for 2024: What is Salesforce? …

Leave a Reply

Your email address will not be published. Required fields are marked *