Building a Voice-Controlled Toy Car with TinyML

Introduction

TinyML makes it possible to embed artificial intelligence directly into microcontrollers, allowing devices to understand and respond to the world without an internet connection. One of the most exciting applications is voice control. By training a small machine learning model to recognize specific commands, you can make a toy car respond to spoken instructions like “go,” “stop,” or “reverse” — all running locally on a low-power microcontroller.

This project blends hardware, embedded programming, and AI, and offers a great opportunity to explore how keyword spotting works in constrained environments.

Concept Overview

The idea is simple: a microphone captures audio, the microcontroller processes it to extract features, and a TinyML model classifies the speech into one of several predefined commands. The microcontroller then drives the motors according to the recognized instruction.

Unlike using a pre-built voice recognition module, this approach allows you to choose the words, adapt to different languages, and fine-tune the system for your environment.

Hardware Requirements

A capable microcontroller is essential. The ESP32 is a strong choice, offering a dual-core processor, generous RAM for audio buffering, and I²S support for digital microphones. An INMP441 MEMS microphone provides clean audio input. A TB6612FNG motor driver efficiently controls the DC motors of a small 2WD or 4WD chassis. Power comes from a rechargeable Li-ion battery pack, with a voltage regulator if needed.

Having a stable and well-structured chassis makes the integration easier, so starting with a robot car kit is often a good idea.

Data Collection

Voice recognition begins with data. You will need multiple recordings for each command, such as “go,” “stop,” “left,” “right,” and “reverse.” Including background noise and unrelated speech samples improves robustness. Recording can be done on a computer with a decent microphone, then downsampled to 16 kHz mono WAV files.

For better accuracy, capture voices from different speakers and in different environments. The more varied your dataset, the better your model will perform in real-world conditions.

Model Training

Training takes place in Python using TensorFlow Lite for Microcontrollers as the target framework. The raw audio is split into short frames, typically 30–40 milliseconds, and transformed into Mel-frequency cepstral coefficients (MFCCs), which are a compact representation of the speech features.

A small convolutional neural network (CNN) or depthwise-separable CNN (DS-CNN) is well-suited for this task. After training, the model should be quantized to INT8 to reduce memory usage and improve inference speed. The result is a .tflite file small enough to run on an ESP32 without exceeding RAM limits.

Model Deployment

Once the model is trained and quantized, it must be converted into a C array so it can be embedded directly in the firmware. The Arduino IDE or PlatformIO can be used to integrate the TensorFlow Lite for Microcontrollers library along with the model file.

The firmware continuously samples audio from the I²S microphone, performs feature extraction, and feeds the MFCCs to the model. The predicted label is then mapped to motor control actions, so that “go” moves the car forward, “left” turns it, and “stop” halts all motion.

Integration and Testing

It’s important to verify each component before combining them. Start by testing motor control manually to ensure the driver and motors are wired correctly. Next, verify the microphone input by printing the audio waveform or MFCC values to the serial monitor. Finally, run the TinyML model on live audio and check if the recognized words match your speech.

In early tests, work in a quiet environment to minimize false detections. Once the system works reliably indoors, test in noisier spaces and add more varied training data to improve performance.

Optimizations

Performance can be improved in several ways. Using a wake word, such as “car,” before issuing commands reduces accidental triggers. Adjusting the microphone gain helps balance sensitivity and noise resistance. Collecting extra data in the target environment allows the model to adapt to specific background sounds.

On the motor side, pulse-width modulation (PWM) can be added to control speed smoothly, and obstacle sensors can prevent collisions.

Power Management

Since the ESP32 will be continuously listening, it consumes more power than a sleeping microcontroller. Strategies like light sleep between audio processing windows or an external wake-up trigger can help extend battery life. Choosing an efficient motor driver also reduces overall consumption.

Conclusion

A TinyML-powered voice-controlled toy car is a rewarding project that combines embedded systems, machine learning, and robotics into one hands-on build. Beyond the fun factor, it demonstrates how AI can run entirely on small devices without relying on the cloud, enabling responsive, private, and portable solutions.

By mastering the process — from data collection to deployment — you not only create an interactive toy but also gain skills that can be applied to a wide range of edge AI applications. This is the essence of TinyML: bringing intelligence directly to the devices that interact with the physical world.