A step-by-step guide to building an AI assistant
Table of Contents
It’s not as difficult as you think to build an AI program that listens to speech and answers questions. We can make the magic happen in an afternoon by leveraging a few python packages and APIs.
Here are some back-and-forth questions and answers between me and the bot:
Me: What is the population of Brazil? Bot: The population of Brazil is estimated to be over 209 million people Me: What does ice cream taste like? Bot: Most ice cream tastes sweet and creamy Me: How do I use chopsticks? Bot: To use chopsticks, you must hold one in each hand. Place the chopstick in your dominant hand between your thumb and index finger, and hold it in place with your middle finger. Place the chopstick in your non-dominant hand between your thumb and index finger, and hold it in place with your ring finger and pinky finger. To pick up food, use the chopstick in your dominant hand to hold the food in place, and then use the chopstick
Granted, these aren’t the most enlightening answers. And that chopstick one at the end is a bit strange, lol. However, the fact that this application can interpret speech and answer questions, no matter how seemingly limited, is amazing in my opinion. And unlike the mainstream AI assistance bots, we can see what’s under the hood here and play around with it.
What This Program Does
Run the file via the command prompt when the user is ready to ask a question
Pyaudio enables the computer mic to pick up speech data
JSON data is sent to AssemblyAI API to be converted to text. Text data is then sent back
Text data is sent to OpenAI API to be channeled into the text-davinci-002 engine for processing
The answer to the question is retrieved and shown on the console below your question
APIs and High-Level Design
This tutorial utilizes two core APIs:
AssemblyAI to transcribe the audio into text.
OpenAI to interpret the question and return an answer. It has also come to my attention that you can leverage OpenAI’s Whisper API to perform the transcription function as well.
Design (high level)
This project is broken up into two files: main and openai_helper.
The ‘main’ script is used for the voice-to-text API connection. It involves setting up a WebSockets server, filling in all the parameters required for pyaudio, and creating asynchronous functions required for sending and receiving the speech data concurrently between our application and AssemblyAi’s server.
The `openai_helper` file is short and is used solely to connect to the open ai “text-davinci-002” engine. This connection is used to receive answers to our questions.
First, we import all the libraries our application will use. Pip installation may be required for some of these, depending on whether you’ve used them. See the comments for context below
Then we set up our pyaudio parameters. These inputs are default settings found in various places on the web. Feel free to experiment as needed, but the defaults worked fine for me. We set the stream variable as our initial container for the audio data, and then we print the default input device parameters as a dictionary. The keys of the dictionary mirror the data fields of PortAudio’s structure. Here’s the code:
Next, we are creating multiple asynchronous functions for the sending and receiving required to transform our verbal questions into text. These functions are running concurrently, which enables the speech data to be converted into base64 format, converted into JSON, sent to the server via API, and then received back in a readable format. The WebSockets server is also a vital piece of the script below, as that’s what makes the direct stream as seamless as it is.
Lastly, we have our simple API connection to openai. If you look at line 44 of the gist above (main3.py), you can see we are pulling the function ask_computer from this other file and using the output as the answers to our questions.
This was a neat project for anyone interested in playing around with the same technology that makes Siri or Alexa function. Not much coding experience is required because we leverage APIs to do our processing. I would highly recommend forking the repo of this project and playing around first-hand if any reader wants to learn more about these technologies. Cheers!
Update: Please check out another project similar to this if you found this one interesting. It’s related to using speech-to-text transcription to create dalle-mini images
This article was originally published here.