AI-Powered Cognitive Wearable for Pets

A Talking Collar That Brings You Closer to Your Pet Than Ever Before.

DRL Team

AI R&D Center

27 May 2025

17 min read

Content

Summary Tech Challenge Solution Impact

Summary

Imagine your pet talks with you 24/7, just like the dog from the Pixar movie Up can. Your pet will become able to express its emotions, fears, and needs, and even remind you to visit a vet clinic.
The device contains speakers and sensors for pet-human conversation, including not only dialogue functionality but also the sounds, actions, the caregiver's voice, pet name recognition, and the ability to send messages and reminders. In addition, the caregiver can choose the pet's "voice" on the collar from dozens of unique voice actors.
Our team developed an intelligent, real-time communication system embedded in a wearable featuring cloud-based ML for dialogue, sound, and activity detection, personalized by caregiver voice responses and a robust event-driven alerts pipeline.

Tech Stack

Python

AWS cloud stack

Starlette websockets

FastConformer

FastFit

Gemini API

ElevenLabs API

Twilio

Voice Activity Detection

Speaker diarization

Keyword spotting

Audio classification

IMU-based activity recognition

Triton Inference Server

ONNX

TensorRT

Postgres

Redis

Prometheus

MongoDB

Tech Challenge

Low-latency dialogue flow. Processing of the caregiver's speech, replicas, and response selection should occur with minimal latency, reminding real human-to-human communication.
Realistic, natural, and emotional voice. The TTS model should provide a highly expressive voice according to the original voice actor chosen by the user.
Few-shot keyword spotting. To save battery life, the system should be activated only after the keyword, the pet's name. However, there will be access to only a few samples of the user's voice recorded during onboarding.
High-quality speech recognition in noisy environments. Most of the use cases for collars are outdoor walks with caregivers. Moreover, a collar on a dog, even in a quiet room, is a scenario with additional noise as the dog breathes, moves, whines, or barks.
Sound recognition. Developing a robust pet sound recognition system required overcoming the lack of open-source audio datasets and tuning models to capture high-frequency patterns often present in dog vocalizations. This involved custom data collection and complex feature extraction.
Emotion recognition. The system includes emotional recognition for pets and caregivers. It uses an audio signal and text context as input to a designed and trained neural network to produce labels to supplement the pet-human interaction context.
IMU-based activity detection and analysis. Predicted activities such as resting, running, or jumping should trigger relevant events that drive the pet's well-being analytics. This requires training in a self-hosted IMU-based classification model supported by an ETL pipeline for sensor data collection and preprocessing.
Data Collection and Labelling. One of the key challenges was data gathering, which included specific pet-related sounds and IMU sensor readings.
Real-time ML models inference at scale. Models should run fast enough to serve predictions instantly (typically in milliseconds) while handling thousands of requests per second. It requires efficient use of computing resources (like GPUs) and the design of scalable, fault-tolerant services.
Event-driven alert system. Based on short-term and long-term events and predictions from ML models, the collar should inform the caregiver about nutrition, dehydration, missed activity, sleep, or potential danger to the pet.

Solution

For specific interaction scenarios, we used the updated SetFit architecture from IBM - FastFit, which was developed to solve semantic collapse for multi-class classification in embedding space. The trained model delivers 20 ms latency on average.
For a generic scenario, we used the Gemini API with streaming to generate a response. Our prompt engineer created a golden pet's character so that the answers match the persona selected during onboarding.
Voice cloning was implemented using ElevenLabs API. Our team handled several experiments with evaluation to optimize cloning hyperparameters and achieve natural, expressive audio samples for each voice actor. For the response generation itself, we used the flash-v2.5 STT model with streaming to reduce latency.
To handle the keyword spotting challenges, we adapted the open-source PLiX model. It aligns audio and text representations via contrastive learning. So, it enables the recognition of new keywords with just a few audio samples without retraining.
As for speech recognition, we needed an accurate voice activity detection model to turn off some system components when idle, thereby conserving battery life. The Trillson2 model was used as a VAD. It was trained on a custom dataset, achieving 98% accuracy on validation.
As a speech-to-text model, we use self-hosted NVIDIA FastConformer-CTC XLarge with streaming support. According to the results of our experiments, this model outperformed the well-known proprietary STT solutions currently on the market.
To solve the data shortage problem, we set up a pipeline for collecting and labelling IMU sensor data and sound data by assembling a separate team.
For environmental sound classification, we trained the BEATs model on the custom dataset, achieving 90% accuracy on validation.
For emotion recognition, we use the emotion2vec+ model (classification based on audio) and Spacy NER (extracting emotion entities such as anger, laughter, excitement, sadness, etc. from text utterances).
Based on the dataset collected through our pipeline, we trained an effective RNN-based model for classifying IMU sensor-based activities, successfully recognizing different movement patterns.
To monitor performance and ensure data quality, key metrics were visualized in real-time using Prometheus and Grafana.
Our team used TensorRT SDK to optimize and accelerate trained models for sound, emotion, and IMU classification, ensuring low-latency execution on GPU hardware. These models, as well as the STT model, are then deployed via Triton Inference Server, which provides dynamic batching, concurrent model execution, ensuring scalable and efficient serving.
To manage and orchestrate real-time data flow, we utilized Redis streams as a lightweight message broker, making it easy to send, receive, and track inference requests and results across services. This stack allows us to achieve a robust, low-latency inference pipeline capable of handling high loads in production environments.

Impact

The developed system allows the collar to recognize the pet's name, interpret behaviors, and respond intelligently through natural voice. This laid the foundation for real-time communication between pets and humans.
Thanks to our models' inference optimizations, the collar delivers seamless interaction with the device. Our system successfully handled over 10k concurrent users with stable latency and high RPS performance during load testing.
The product attracted hundreds of users at the early-stage launch, demonstrating strong market interest and the value of our solution.
The alert system, which has personalized voices, reminders, and notifications, allows users to stay connected with their pets anywhere. Beyond the product itself, our work sets a new standard for smart collars and encourages further innovation in pet care technology.

Important copyright notice © DataRoot Labs and datarootlabs.com, 2025. Unauthorized use and/or duplication of this material without express and written permission from this site’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to DataRoot Labs and datarootlabs.com with appropriate and specific direction to the original content.