Live Demo | Keet | NPM Package
parakeet.js is browser speech-to-text for NVIDIA Parakeet ONNX models. It runs fully client-side using onnxruntime-web with WebGPU or WASM execution.
npm i parakeet.js
# or
yarn add parakeet.js
import { fromHub } from 'parakeet.js';
const model = await fromHub('parakeet-tdt-0.6b-v3', {
backend: 'webgpu-hybrid',
encoderQuant: 'fp32',
decoderQuant: 'int8',
});
// `file` should be a File (for example from <input type="file">)
const pcm = await getMono16kPcm(file); // returns mono Float32Array at 16 kHz
const result = await model.transcribe(pcm, 16000, {
returnTimestamps: true,
returnConfidences: true,
});
console.log(result.utterance_text);
Use your existing app audio pipeline for getMono16kPcm(file) (Web Audio API, ffmpeg, server-side decode, etc.). A complete browser example is available in examples/demo/src/App.jsx (transcribeFile flow).
fromHub(repoIdOrModelKey, options): easiest path. Accepts model keys like parakeet-tdt-0.6b-v3 or full repo IDs.fromUrls(cfg): explicit URL wiring when you host assets yourself.import { fromUrls } from 'parakeet.js';
const model = await fromUrls({
encoderUrl: 'https://huggingface.co/ysdede/parakeet-tdt-0.6b-v3-onnx/resolve/main/encoder-model.onnx',
decoderUrl: 'https://huggingface.co/ysdede/parakeet-tdt-0.6b-v3-onnx/resolve/main/decoder_joint-model.int8.onnx',
tokenizerUrl: 'https://huggingface.co/ysdede/parakeet-tdt-0.6b-v3-onnx/resolve/main/vocab.txt',
// Only needed if you choose preprocessorBackend: 'onnx'
preprocessorUrl: 'https://huggingface.co/ysdede/parakeet-tdt-0.6b-v3-onnx/resolve/main/nemo128.onnx',
backend: 'webgpu-hybrid',
preprocessorBackend: 'js',
});
backend:
webgpu (alias accepted)wasmwebgpu-hybrid, webgpu-strictgetParakeetModel/fromHub, if backend starts with webgpu and encoderQuant is int8, encoder quantization is forced to fp32.int8, fp32, and fp16.encoder-model.fp16.onnx).getParakeetModel/fromHub are strict about requested quantization: they do not auto-switch fp16 to fp32.decoderQuant: 'int8' or decoderQuant: 'fp32' explicitly.preprocessorBackend is js (default) or onnx.parakeet.js now uses the pr74 real-FFT path in the default JS preprocessor (preprocessorBackend: 'js').
This keeps feature compatibility with the previous implementation while reducing mel extraction cost.
| Item | Previous JS path | New JS path (default) |
|---|---|---|
| FFT strategy | Full N=512 complex FFT per frame |
Real-FFT via one N/2=256 complex FFT + spectrum reconstruction (pr74) |
| Expected speed | Baseline | Faster mel stage (commonly around ~1.5x in local mel benchmarks) |
| Output behavior | NeMo-compatible normalized log-mel | Same behavior and ONNX-reference accuracy thresholds preserved |
| API changes | N/A | None (JsPreprocessor / IncrementalMelProcessor unchanged) |
If you need exact ONNX preprocessor execution instead of JS mel, set preprocessorBackend: 'onnx'.
Before using FP16 examples: ensure FP16 artifacts exist in the target repo and your browser/runtime supports FP16 execution (WebGPU FP16 path).
Load known FP16 model key:
import { fromHub } from 'parakeet.js';
const model = await fromHub('parakeet-tdt-0.6b-v3', {
backend: 'webgpu-hybrid',
encoderQuant: 'fp16',
decoderQuant: 'fp16',
});
Use explicit FP16 URLs:
import { fromUrls } from 'parakeet.js';
const model = await fromUrls({
encoderUrl: 'https://huggingface.co/ysdede/parakeet-tdt-0.6b-v3-onnx/resolve/main/encoder-model.fp16.onnx',
decoderUrl: 'https://huggingface.co/ysdede/parakeet-tdt-0.6b-v3-onnx/resolve/main/decoder_joint-model.fp16.onnx',
tokenizerUrl: 'https://huggingface.co/ysdede/parakeet-tdt-0.6b-v3-onnx/resolve/main/vocab.txt',
preprocessorBackend: 'js',
backend: 'webgpu-hybrid',
});
The demo flow in examples/demo/src/App.jsx is:
fromHub(...) for hub loading, or fromUrls(...) for explicit URLs).AudioContext({ sampleRate: 16000 }) + decodeAudioData(...).Float32Array) by averaging channels when needed.model.transcribe(pcm, 16000, options) and render utterance_text.Reference code:
App component in examples/demo/src/App.jsx (loadModel / transcribeFile flow)model.transcribe(...) returns a TranscribeResult with this shape:
type TranscribeResult = {
utterance_text: string;
words: Array<{
text: string;
start_time: number;
end_time: number;
confidence?: number;
}>;
tokens?: Array<{
token: string;
raw_token?: string;
is_word_start?: boolean;
start_time?: number;
end_time?: number;
confidence?: number;
}>;
confidence_scores?: {
token?: number[] | null;
token_avg?: number | null;
word?: number[] | null;
word_avg?: number | null;
frame: number[] | null;
frame_avg: number | null;
overall_log_prob: number | null;
};
metrics?: {
preprocess_ms: number;
encode_ms: number;
decode_ms: number;
tokenize_ms: number;
total_ms: number;
rtf: number;
mel_cache?: { cached_frames: number; new_frames: number };
} | null;
is_final: boolean;
tokenIds?: number[];
frameIndices?: number[];
logProbs?: number[];
tdtSteps?: number[];
};
returnTimestamps for meaningful start_time/end_time.returnConfidences for per-token/per-word confidence fields.returnTokenIds, returnFrameIndices, returnLogProbs, returnTdtSteps.Keet is a reference real-time app built on parakeet.js (repo).
createStreamingTranscriber(...).UtteranceBasedMerger) with cursor/windowed chunk processing.
npm run docs:api
MIT