Live Demo | Keet | NPM Package
High-performance WebGPU speech recognition for NVIDIA Parakeet in the browser.
parakeet.js is browser speech-to-text for NVIDIA Parakeet ONNX models. It runs fully client-side using onnxruntime-web with WebGPU or WASM execution.
If you are looking for a JavaScript or browser runtime for NVIDIA Parakeet, parakeet.js is the package for that use case.
transcribeLongAudio(...)npm i parakeet.js
# or
yarn add parakeet.js
import { fromHub } from 'parakeet.js';
const model = await fromHub('parakeet-tdt-0.6b-v3', {
backend: 'webgpu',
encoderQuant: 'fp32',
decoderQuant: 'int8',
});
// `file` should be a File (for example from <input type="file">)
const pcm = await getMono16kPcm(file); // returns mono Float32Array at 16 kHz
const result = await model.transcribe(pcm, 16000, {
returnTimestamps: true,
returnConfidences: true,
});
console.log(result.utterance_text);
Use your existing app audio pipeline for getMono16kPcm(file) (Web Audio API, ffmpeg, server-side decode, etc.). A complete browser example is available in examples/demo/src/App.jsx (transcribeFile flow).
transcribe(audio, sampleRate, opts):
best for short clips, direct uploads, chunk-by-chunk processing, or when your app already owns the chunking strategytranscribeLongAudio(audio, sampleRate, opts):
best for longer recordings where you want built-in windowing, chunk assembly, and timestamped merged outputcreateStreamingTranscriber(opts):
best for contiguous real-time or near-real-time streaming flowsfromHub(repoIdOrModelKey, options): easiest path. Accepts model keys like parakeet-tdt-0.6b-v3 or full repo IDs.fromUrls(cfg): explicit URL wiring when you host assets yourself.import { fromUrls } from 'parakeet.js';
const model = await fromUrls({
encoderUrl: 'https://huggingface.co/ysdede/parakeet-tdt-0.6b-v3-onnx/resolve/main/encoder-model.onnx',
decoderUrl: 'https://huggingface.co/ysdede/parakeet-tdt-0.6b-v3-onnx/resolve/main/decoder_joint-model.int8.onnx',
tokenizerUrl: 'https://huggingface.co/ysdede/parakeet-tdt-0.6b-v3-onnx/resolve/main/vocab.txt',
// Only needed if you choose preprocessorBackend: 'onnx'
preprocessorUrl: 'https://huggingface.co/ysdede/parakeet-tdt-0.6b-v3-onnx/resolve/main/nemo128.onnx',
backend: 'webgpu',
preprocessorBackend: 'js',
});
backend:
webgpuwebgpu-hybrid (same execution behavior as webgpu; accepted for compatibility)wasmwebgpu means encoder on WebGPU and decoder on WASM.webgpu-hybrid is equivalent to webgpu in this library.wasm runs both encoder and decoder on WASM.getParakeetModel/fromHub, if backend starts with webgpu and encoderQuant is int8, encoder quantization is forced to fp32.int8, fp32, and fp16.encoder-model.fp16.onnx).getParakeetModel/fromHub are strict about requested quantization: they do not auto-switch fp16 to fp32.preprocessorBackend is js (default) or onnx.transcribeLongAudio() is the long-form helper API for built-in sentence-aware chunking.
Its exported TypeScript types are LongAudioTranscribeOptions and LongAudioTranscribeResult.
Use it when you want built-in sentence-aware windowing and merged chunks for long recordings instead of manually splitting audio in application code.
Before using FP16 examples: ensure FP16 artifacts exist in the target repo and your browser/runtime supports FP16 execution (WebGPU FP16 path).
Load known FP16 model key:
import { fromHub } from 'parakeet.js';
const model = await fromHub('parakeet-tdt-0.6b-v3', {
backend: 'webgpu',
encoderQuant: 'fp16',
decoderQuant: 'int8',
});
Use explicit FP16 URLs:
import { fromUrls } from 'parakeet.js';
const model = await fromUrls({
encoderUrl: 'https://huggingface.co/ysdede/parakeet-tdt-0.6b-v3-onnx/resolve/main/encoder-model.fp16.onnx',
decoderUrl: 'https://huggingface.co/ysdede/parakeet-tdt-0.6b-v3-onnx/resolve/main/decoder_joint-model.fp16.onnx',
tokenizerUrl: 'https://huggingface.co/ysdede/parakeet-tdt-0.6b-v3-onnx/resolve/main/vocab.txt',
preprocessorBackend: 'js',
backend: 'webgpu',
});
The demo flow in examples/demo/src/App.jsx is:
fromHub(...) for hub loading, or fromUrls(...) for explicit URLs).AudioContext({ sampleRate: 16000 }) + decodeAudioData(...).Float32Array) by averaging channels when needed.model.transcribe(pcm, 16000, options) and render utterance_text.Reference code:
App component in examples/demo/src/App.jsx (loadModel / transcribeFile flow)transcribe() options and result behaviorreturnTimestamps is off by default.
So: by default, transcribe(...) does not return meaningful timestamps.
transcribe(audio, sampleRate, opts) options| Option | Default | Effect |
|---|---|---|
returnTimestamps |
false |
Adds start_time / end_time to tokens[] and words[]. |
returnConfidences |
false |
Adds per-token/per-word confidence fields and detailed confidence_scores. |
temperature |
1.0 |
Decoder temperature (1.0 = greedy baseline behavior). |
debug |
false |
Enables debug logs; also causes metrics to be populated. |
enableProfiling |
true |
When true, returns timing/RTF in metrics. |
skipCMVN |
false |
Skips CMVN in preprocessing. |
frameStride |
1 |
Decoder frame advance stride. |
previousDecoderState |
null |
Continue decoding from an earlier chunk (streaming/stateful usage). |
returnDecoderState |
false |
Includes decoderState in the result for next-call handoff. |
timeOffset |
0 |
Offset (seconds) added to emitted timestamps. |
returnTokenIds |
false |
Includes tokenIds in result. |
returnFrameIndices |
false |
Includes frameIndices (token-to-encoder-frame alignment). |
returnLogProbs |
false |
Includes per-token logProbs. |
returnTdtSteps |
false |
Includes per-token tdtSteps (duration predictor outputs). |
prefixSamples |
0 |
Enables incremental mel-cache reuse when prefix audio matches previous call. |
precomputedFeatures |
null |
Bypasses preprocessor by supplying already-computed mel features. |
incremental |
null |
Incremental decode cache config: { cacheKey, prefixSeconds }. |
type TranscribeResult = {
utterance_text: string;
words: Array<{
text: string;
start_time: number;
end_time: number;
confidence?: number;
}>;
tokens?: Array<{
token: string;
raw_token?: string;
is_word_start?: boolean;
start_time?: number;
end_time?: number;
confidence?: number;
}>;
confidence_scores?: {
token?: number[] | null;
token_avg?: number | null;
word?: number[] | null;
word_avg?: number | null;
frame: number[] | null;
frame_avg: number | null;
overall_log_prob: number | null;
};
metrics?: {
preprocess_ms: number;
encode_ms: number;
decode_ms: number;
tokenize_ms: number;
total_ms: number;
rtf: number;
mel_cache?: { cached_frames: number; new_frames: number } | null;
preprocessor_backend?: 'js' | 'onnx' | string; // runtime field
} | null;
is_final: boolean;
decoderState?: {
s1: Float32Array;
s2: Float32Array;
dims1: number[];
dims2: number[];
};
tokenIds?: number[];
frameIndices?: number[];
logProbs?: number[];
tdtSteps?: number[];
};
| Call options | words |
tokens |
confidence_scores |
metrics |
|---|---|---|---|---|
default ({}) |
[] (empty) |
omitted | omitted | present (enableProfiling default is true) |
{ returnTimestamps: true } |
timestamped words | timestamped tokens | minimal (frame/frame_avg/overall_log_prob are null) |
present by default |
{ returnConfidences: true } |
words with confidence |
tokens with confidence |
detailed token/word/frame confidence stats | present by default |
{ returnTimestamps: true, returnConfidences: true } |
timestamped + confidence | timestamped + confidence | detailed token/word/frame confidence stats | present by default |
Notes:
start_time / end_time are only meaningful when returnTimestamps: true.returnTokenIds, returnFrameIndices, returnLogProbs, returnTdtSteps.enableProfiling: false and debug: false, then metrics is null.timeOffset values passed to transcribe(...) are coerced to 0 with a warning for compatibility.transcribe(...) or computeFeatures(...) are sanitized to 0 with a warning for compatibility.transcribeLongAudio() behaviorUse it when you want built-in sentence-aware windowing and chunk assembly for long recordings such as meetings, podcasts, call recordings, interviews, or lectures.
Internally, long-form transcription does not just emit fixed overlapping windows. It transcribes windows, detects sentence boundaries from timestamped words, finalizes completed segments, and advances from the last stable boundary when possible.
const result = await model.transcribeLongAudio(pcm, 16000, {
returnTimestamps: true,
chunkLengthS: 95,
});
console.log(result.text);
console.log(result.chunks);
Use timeOffset only when this audio starts later inside a larger source:
const result = await model.transcribeLongAudio(pcmSlice, 16000, {
returnTimestamps: true,
timeOffset: 12.5,
});
Word-level chunk output:
const result = await model.transcribeLongAudio(pcm, 16000, {
returnTimestamps: 'word',
});
console.log(result.words);
console.log(result.chunks);
transcribeLongAudio(audio, sampleRate, opts) options| Option | Default | Effect |
|---|---|---|
returnTimestamps |
false |
true returns sentence-like chunks; 'word' returns per-word chunks. |
chunkLengthS |
0 |
Fixed window length in seconds. 0 enables automatic window sizing for long inputs. |
timeOffset |
0 |
Optional base offset (seconds) added to returned chunk/word timestamps. |
other transcribe() options |
varies | Forwarded to each internal transcription window. |
type LongAudioTranscribeResult = {
text: string;
words?: Array<{
text: string;
start_time: number;
end_time: number;
confidence?: number;
}>;
chunks?: Array<{
text: string;
timestamp: [number, number];
}>;
};
Notes:
LongAudioTranscribeOptions and LongAudioTranscribeResult.returnTimestamps: true returns merged sentence-like chunks.returnTimestamps: 'word' returns per-word chunks while still including merged words.transcribeLongAudio(...) falls back to a single internal transcribe(...) call.returnConfidences, debug, enableProfiling, temperature, and skipCMVN are forwarded to the internal transcription windows.Keet is a reference real-time app built on parakeet.js (repo).
createStreamingTranscriber(...).UtteranceBasedMerger) with cursor/windowed chunk processing.
npm run docs:api
MIT