Code to github - https://github.com/rakshit2020/Live-Streaming-Data-RAG
Rakshit Aralimatti
RakshitAralimatti
AI & ML interests
Nvidia
Recent Activity
commented on
their
article
22 days ago
I Built a RAG System That Listens to Live BBC News and Answers Questions About "What Happened 10 Minutes Ago"
replied to
their
post
22 days ago
I built something crazy you never saw before.
Please check - https://huggingface.co/blog/RakshitAralimatti/streaming-data-rag
A real-time Streaming Data to RAG system that listens to live radio, transcribes it on-the-fly, and lets you query across TIME.
Not just "what was discussed" β but "what happened in the last 10 minutes on channel 0?" or "at 9 AM, what was the breaking news?" This is RAG that understands temporal context.
reacted
to
their
post
with π₯
22 days ago
I built something crazy you never saw before.
Please check - https://huggingface.co/blog/RakshitAralimatti/streaming-data-rag
A real-time Streaming Data to RAG system that listens to live radio, transcribes it on-the-fly, and lets you query across TIME.
Not just "what was discussed" β but "what happened in the last 10 minutes on channel 0?" or "at 9 AM, what was the breaking news?" This is RAG that understands temporal context.
Organizations
replied to
their
post
22 days ago
Post
2432
I built something crazy you never saw before.
Please check - https://huggingface.co/blog/RakshitAralimatti/streaming-data-rag
A real-time Streaming Data to RAG system that listens to live radio, transcribes it on-the-fly, and lets you query across TIME.
Not just "what was discussed" β but "what happened in the last 10 minutes on channel 0?" or "at 9 AM, what was the breaking news?" This is RAG that understands temporal context.
Please check - https://huggingface.co/blog/RakshitAralimatti/streaming-data-rag
A real-time Streaming Data to RAG system that listens to live radio, transcribes it on-the-fly, and lets you query across TIME.
Not just "what was discussed" β but "what happened in the last 10 minutes on channel 0?" or "at 9 AM, what was the breaking news?" This is RAG that understands temporal context.
posted
an
update
22 days ago
Post
2432
I built something crazy you never saw before.
Please check - https://huggingface.co/blog/RakshitAralimatti/streaming-data-rag
A real-time Streaming Data to RAG system that listens to live radio, transcribes it on-the-fly, and lets you query across TIME.
Not just "what was discussed" β but "what happened in the last 10 minutes on channel 0?" or "at 9 AM, what was the breaking news?" This is RAG that understands temporal context.
Please check - https://huggingface.co/blog/RakshitAralimatti/streaming-data-rag
A real-time Streaming Data to RAG system that listens to live radio, transcribes it on-the-fly, and lets you query across TIME.
Not just "what was discussed" β but "what happened in the last 10 minutes on channel 0?" or "at 9 AM, what was the breaking news?" This is RAG that understands temporal context.
reacted to
ovi054's
post with π₯
about 1 month ago
Post
6130
Introducing Anim Lab AIβ‘
My submission for the MCP 1st Birthday Hackathon
Turn any math concept or logic into a clear video explanation instantly using AI.
π Try it now: MCP-1st-Birthday/anim-lab-ai
Demo outputs are attached π
My submission for the MCP 1st Birthday Hackathon
Turn any math concept or logic into a clear video explanation instantly using AI.
π Try it now: MCP-1st-Birthday/anim-lab-ai
Demo outputs are attached π
replied to
their
post
about 2 months ago
Modern OCR in healthcare is extremely reliable when implemented correctly. I've personally built OCR + RAG systems for healthcare clients, and the results have been remarkable.
posted
an
update
about 2 months ago
Post
1374
OCR has absolutely blown up in 2025, and honestly, my perspective on document processing has completely changed.
This year has been wild. Vision Language Models like Nanonets OCR2-3B hit the scene and suddenly we're getting accuracy on complex forms (vs for traditional OCR). We're talking handwritten checkboxes, watermarked documents, multi-column layouts, even LaTeX equations all handled in a single pass.β
The market numbers say it all: OCR accuracy passed 98% for printed text, AI integration is everywhere, and real-time processing is now standard. The entire OCR market is hitting $25.13 billion in 2025 because this tech actually works now.
I wrote a detailed Medium article walking through:
1. Why vision LMs changed the game
2. NVIDIA NeMo Retriever architecture
3. Complete code breakdown
4. Real government/healthcare use cases
5. Production deployment guide
Article: https://medium.com/@rakshitaralimatti2001/nvidia-nemo-retriever-ocr-building-document-intelligence-systems-for-enterprise-and-government-42a6684c37a1
Try It Yourself
This year has been wild. Vision Language Models like Nanonets OCR2-3B hit the scene and suddenly we're getting accuracy on complex forms (vs for traditional OCR). We're talking handwritten checkboxes, watermarked documents, multi-column layouts, even LaTeX equations all handled in a single pass.β
The market numbers say it all: OCR accuracy passed 98% for printed text, AI integration is everywhere, and real-time processing is now standard. The entire OCR market is hitting $25.13 billion in 2025 because this tech actually works now.
I wrote a detailed Medium article walking through:
1. Why vision LMs changed the game
2. NVIDIA NeMo Retriever architecture
3. Complete code breakdown
4. Real government/healthcare use cases
5. Production deployment guide
Article: https://medium.com/@rakshitaralimatti2001/nvidia-nemo-retriever-ocr-building-document-intelligence-systems-for-enterprise-and-government-42a6684c37a1
Try It Yourself
reacted to
prithivMLmods's
post with π₯
4 months ago
Post
3157
I'm a Hugging Face Fellow now, guys!π€β€οΈ
With the same passion, trust, and momentum to contribute to the community, Iβm excited to do some amazing things to wrap up Q3 and Q4 of 2025. And importantly, Iβve been lucky enough to receive some knowledge and guidance from @merve to build open-source demos and stuff. Thank you for the belief.
Thank you β much love.
Long live open source!
β Prithiv
With the same passion, trust, and momentum to contribute to the community, Iβm excited to do some amazing things to wrap up Q3 and Q4 of 2025. And importantly, Iβve been lucky enough to receive some knowledge and guidance from @merve to build open-source demos and stuff. Thank you for the belief.
Thank you β much love.
Long live open source!
β Prithiv
replied to
andywu-kby's
post
4 months ago
I tried it, Its very COOL.
posted
an
update
4 months ago
Post
261
Have you ever wanted to easily deploy a cutting-edge speech recognition system that actually works in real time? How about one powered by NVIDIA GPUs on Kubernetes, but without the headache of complicated installs?
Well, your wait is over! My latest blog shows how to deploy NVIDIA Riva ASR in just 5 minutes using Helm charts. From validating GPU readiness in Kubernetes to customizing your ASR models and spinning up the service, this guide covers it all.
Read it here - https://medium.com/@rakshitaralimatti2001/deploy-nvidia-riva-asr-on-kubernetes-gpu-ready-in-minutes-30955d6ed7b8
BONUS: I even built simple Streamlit apps so you can test with your mic or upload audio files to see the magic live.
β¨ Bookmark this post and the blog for your next voice AI project or production-ready speech application!
Well, your wait is over! My latest blog shows how to deploy NVIDIA Riva ASR in just 5 minutes using Helm charts. From validating GPU readiness in Kubernetes to customizing your ASR models and spinning up the service, this guide covers it all.
Read it here - https://medium.com/@rakshitaralimatti2001/deploy-nvidia-riva-asr-on-kubernetes-gpu-ready-in-minutes-30955d6ed7b8
BONUS: I even built simple Streamlit apps so you can test with your mic or upload audio files to see the magic live.
β¨ Bookmark this post and the blog for your next voice AI project or production-ready speech application!
reacted to
ACloudCenter's
post with π₯
4 months ago
Post
1854
I've really been into testing the various ASR, TTS, and other audio related models. This space showcases the Nvidia Canary-Qwen 2.5B model. The model is able to transcribe incredibly fast and and combine qwen for queries about the transcript.
All audio example files were generated with my adjacent VibeVoice Conference Generator Space. Another really cool model!!
ACloudCenter/canary-qwen-transcriber-2.5b
All audio example files were generated with my adjacent VibeVoice Conference Generator Space. Another really cool model!!
ACloudCenter/canary-qwen-transcriber-2.5b
reacted to
codelion's
post with π₯
4 months ago
Post
6186
I recently worked on a LoRA that improves tool use in LLM. Thought the approach might interest folks here.
The issue I have had when trying to use some of the local LLMs with coding agents is this:
Me: "Find all API endpoints with authentication in this codebase"
LLM: "You should look for @app .route decorators and check if they have auth middleware..."
But I often want it to search the files and show me but the LLM doesn't trigger a tool use call.
To fine-tune it for tool use I combined two data sources:
1. Magpie scenarios - 5000+ diverse tasks (bug hunting, refactoring, security audits)
2. Real execution - Ran these on actual repos (FastAPI, Django, React) to get authentic tool responses
This ensures the model learns both breadth (many scenarios) and depth (real tool behavior).
Tools We Taught:
-
-
-
-
-
-
Improvements:
- Tool calling accuracy: 12% β 80%
- Correct parameters: 8% β 87%
- Multi-step tasks: 3% β 78%
- End-to-end completion: 5% β 80%
- Tools per task: 0.2 β 3.8
The LoRA really improves on intential tool call as an example consider the query: "Find ValueError in payment module"
The response proceeds as follows:
1. Calls
2. Gets 4 matches across 3 files
3. Calls
4. Analyzes context
5. Reports: "Found 3 ValueError instances: payment/processor.py:47 for invalid amount, payment/validator.py:23 for unsupported currency..."
Resources:
- Colab notebook https://colab.research.google.com/github/codelion/ellora/blob/main/Ellora_Recipe_3_Enhanced_Tool_Calling_and_Code_Understanding.ipynb
- Model - codelion/Llama-3.2-1B-Instruct-tool-calling-lora
- GitHub - https://github.com/codelion/ellora
The issue I have had when trying to use some of the local LLMs with coding agents is this:
Me: "Find all API endpoints with authentication in this codebase"
LLM: "You should look for @app .route decorators and check if they have auth middleware..."
But I often want it to search the files and show me but the LLM doesn't trigger a tool use call.
To fine-tune it for tool use I combined two data sources:
1. Magpie scenarios - 5000+ diverse tasks (bug hunting, refactoring, security audits)
2. Real execution - Ran these on actual repos (FastAPI, Django, React) to get authentic tool responses
This ensures the model learns both breadth (many scenarios) and depth (real tool behavior).
Tools We Taught:
-
read_file - Actually read file contents-
search_files - Regex/pattern search across codebases-
find_definition - Locate classes/functions-
analyze_imports - Dependency tracking-
list_directory - Explore structure-
run_tests - Execute test suitesImprovements:
- Tool calling accuracy: 12% β 80%
- Correct parameters: 8% β 87%
- Multi-step tasks: 3% β 78%
- End-to-end completion: 5% β 80%
- Tools per task: 0.2 β 3.8
The LoRA really improves on intential tool call as an example consider the query: "Find ValueError in payment module"
The response proceeds as follows:
1. Calls
search_files with pattern "ValueError"2. Gets 4 matches across 3 files
3. Calls
read_file on each match4. Analyzes context
5. Reports: "Found 3 ValueError instances: payment/processor.py:47 for invalid amount, payment/validator.py:23 for unsupported currency..."
Resources:
- Colab notebook https://colab.research.google.com/github/codelion/ellora/blob/main/Ellora_Recipe_3_Enhanced_Tool_Calling_and_Code_Understanding.ipynb
- Model - codelion/Llama-3.2-1B-Instruct-tool-calling-lora
- GitHub - https://github.com/codelion/ellora
reacted to
codelion's
post with π₯
4 months ago
Post
5283
I wanted to share a technique that's been working really well for recovering performance after INT4 quantization.
Typically, quantizing the LLM to INT4 (unlike say INT8) for inference can incur some accuracy loss. Instead of accepting the quality loss, we used the FP16 model as a teacher to train a tiny LoRA adapter (rank=16) for the quantized model. The cool part: the model generates its own training data using the Magpie technique so no external datasets needed. This is critical because we want to remain as much as possible in the distribution of the model's natural responses.
Last year Apple's foundational models paper (https://arxiv.org/pdf/2407.21075) had proposed a similar technique and found "By using accuracy-recovery LoRA adapters with only rank 16, Alpaca win rate can be improved by 7-18%, GMS8K accuracy is boosted by 5-10%." (page 47).
We saw similar results on Qwen3-0.6B:
Perplexity: 2.40 β 2.09 (only 5.7% degradation from FP16 baseline)
Memory: Only 0.28GB vs 1.0GB for FP16 (75% reduction)
Speed: 3.0x faster inference than FP16
Quality: Generates correct, optimized code solutions
- Pre-trained adapter: codelion/Qwen3-0.6B-accuracy-recovery-lora
- GitHub repo: https://github.com/codelion/ellora
Happy to answer questions about the implementation or help anyone trying to replicate this. The key insight is that quantization errors are systematic and learnable - a small adapter can bridge the gap without negating the benefits of quantization.
Has anyone else experimented with self-distillation for quantization recovery? Would love to hear about different approaches!
Typically, quantizing the LLM to INT4 (unlike say INT8) for inference can incur some accuracy loss. Instead of accepting the quality loss, we used the FP16 model as a teacher to train a tiny LoRA adapter (rank=16) for the quantized model. The cool part: the model generates its own training data using the Magpie technique so no external datasets needed. This is critical because we want to remain as much as possible in the distribution of the model's natural responses.
Last year Apple's foundational models paper (https://arxiv.org/pdf/2407.21075) had proposed a similar technique and found "By using accuracy-recovery LoRA adapters with only rank 16, Alpaca win rate can be improved by 7-18%, GMS8K accuracy is boosted by 5-10%." (page 47).
We saw similar results on Qwen3-0.6B:
Perplexity: 2.40 β 2.09 (only 5.7% degradation from FP16 baseline)
Memory: Only 0.28GB vs 1.0GB for FP16 (75% reduction)
Speed: 3.0x faster inference than FP16
Quality: Generates correct, optimized code solutions
- Pre-trained adapter: codelion/Qwen3-0.6B-accuracy-recovery-lora
- GitHub repo: https://github.com/codelion/ellora
Happy to answer questions about the implementation or help anyone trying to replicate this. The key insight is that quantization errors are systematic and learnable - a small adapter can bridge the gap without negating the benefits of quantization.
Has anyone else experimented with self-distillation for quantization recovery? Would love to hear about different approaches!
replied to
their
post
4 months ago
Thanks @treehugg3 for the thoughtful feedback! π My main motive here was to break it down in simple words for people who are new to AI or just starting to learn about reasoning models. I completely understand your concern Iβll make sure to include more detailed explanations, examples, sources, and technical depth (like backtracking and novel reasoning paths) in upcoming blogs. Really appreciate your input, it helps me improve!
replied to
their
post
4 months ago
Thanks @Edalexan .Glad that you found it useful.
posted
an
update
5 months ago
Post
6790
When you ask ChatGPT, Claude, or Gemini a really tough question,
you might notice that little "thinking..." moment before it answers.
But what does it actually mean when an LLM is βthinkingβ?
Imagine a chess player pausing before their next move not because they donβt know how to play, but because theyβre running through possibilities, weighing options, and choosing the best one.
LLMs do something similarβ¦ except theyβre not really thinking like us.
Hereβs the surprising part :-
You might think these reasoning skills come from futuristic architectures or alien neural networks.
In reality, most reasoning LLMs still use the same transformer decoder-only architecture as other models
The real magic?
Itβs in how theyβre trained and what data they learn from.
Can AI actually think, or is it just insanely good at faking it?
I broke it down in a simple, 4-minute Medium read.
Bet youβll walk away with at least one βaha!β moment. π
Read here - https://lnkd.in/edZ8Ceyg
you might notice that little "thinking..." moment before it answers.
But what does it actually mean when an LLM is βthinkingβ?
Imagine a chess player pausing before their next move not because they donβt know how to play, but because theyβre running through possibilities, weighing options, and choosing the best one.
LLMs do something similarβ¦ except theyβre not really thinking like us.
Hereβs the surprising part :-
You might think these reasoning skills come from futuristic architectures or alien neural networks.
In reality, most reasoning LLMs still use the same transformer decoder-only architecture as other models
The real magic?
Itβs in how theyβre trained and what data they learn from.
Can AI actually think, or is it just insanely good at faking it?
I broke it down in a simple, 4-minute Medium read.
Bet youβll walk away with at least one βaha!β moment. π
Read here - https://lnkd.in/edZ8Ceyg
posted
an
update
5 months ago
Post
249
π€ Ever wondered how OpenAIβs massive GPTβOSSβ20B runs on just 16β―GB of memory or how GPTβOSSβ120B runs on a single H100 GPU?
Seems impossible, right?
The secret is Native MXFP4 Quantization it's a 4-bit floating-point format thatβs making AI models faster, lighter, and more deployable than ever.
π§ Whatβs MXFP4?
MXFP4, or Microscaling FP4, is a specialized 4-bit floatingβpoint format (E2M1) standardized by the Open Compute Project under the MX (Microscaling) specification. It compresses groups of 32 values using a shared 8-bit scale (E8M0), dramatically lowering memory usage while preserving the dynamic range perfect for compact AI model deployment.
π‘ Think of it like this:
Instead of everyone ordering their own expensive meal (full-precision weights), a group shares a family meal (shared scaling). Itβs cheaper, lighter, and still gets the job done.
βοΈ Iβve broken all of this down in my first Medium blog:
Whatβs MXFP4? The 4-Bit Secret Powering OpenAIβs GPTβOSS Models on Modest Hardware
Link - https://medium.com/@rakshitaralimatti2001/4-bit-alchemy-how-mxfp4-makes-massive-models-like-gpt-oss-feasible-for-everyone-573d6630b56c
HF - https://huggingface.co/blog/RakshitAralimatti/learn-ai-with-me
Seems impossible, right?
The secret is Native MXFP4 Quantization it's a 4-bit floating-point format thatβs making AI models faster, lighter, and more deployable than ever.
π§ Whatβs MXFP4?
MXFP4, or Microscaling FP4, is a specialized 4-bit floatingβpoint format (E2M1) standardized by the Open Compute Project under the MX (Microscaling) specification. It compresses groups of 32 values using a shared 8-bit scale (E8M0), dramatically lowering memory usage while preserving the dynamic range perfect for compact AI model deployment.
π‘ Think of it like this:
Instead of everyone ordering their own expensive meal (full-precision weights), a group shares a family meal (shared scaling). Itβs cheaper, lighter, and still gets the job done.
βοΈ Iβve broken all of this down in my first Medium blog:
Whatβs MXFP4? The 4-Bit Secret Powering OpenAIβs GPTβOSS Models on Modest Hardware
Link - https://medium.com/@rakshitaralimatti2001/4-bit-alchemy-how-mxfp4-makes-massive-models-like-gpt-oss-feasible-for-everyone-573d6630b56c
HF - https://huggingface.co/blog/RakshitAralimatti/learn-ai-with-me
Post
284
π Introducing Multi-Model RAG with LangChain!
Understand and query across images, tables, text, and files β all in one pipeline.
Get smart answers with relevant visuals or tables as references.
π GitHub: https://github.com/rakshit2020/Multi-Model-RAG-LangChain
π₯ Demo video included β see it in action!
β Built for developers & researchers
β Try it out, explore the code, and drop a star if you find it useful!
Understand and query across images, tables, text, and files β all in one pipeline.
Get smart answers with relevant visuals or tables as references.
π GitHub: https://github.com/rakshit2020/Multi-Model-RAG-LangChain
π₯ Demo video included β see it in action!
β Built for developers & researchers
β Try it out, explore the code, and drop a star if you find it useful!
posted
an
update
5 months ago
Post
284
π Introducing Multi-Model RAG with LangChain!
Understand and query across images, tables, text, and files β all in one pipeline.
Get smart answers with relevant visuals or tables as references.
π GitHub: https://github.com/rakshit2020/Multi-Model-RAG-LangChain
π₯ Demo video included β see it in action!
β Built for developers & researchers
β Try it out, explore the code, and drop a star if you find it useful!
Understand and query across images, tables, text, and files β all in one pipeline.
Get smart answers with relevant visuals or tables as references.
π GitHub: https://github.com/rakshit2020/Multi-Model-RAG-LangChain
π₯ Demo video included β see it in action!
β Built for developers & researchers
β Try it out, explore the code, and drop a star if you find it useful!
reacted to
hexgrad's
post with π₯
11 months ago
Post
7496
Wanted: Peak Data. I'm collecting audio data to train another TTS model:
+ AVM data: ChatGPT Advanced Voice Mode audio & text from source
+ Professional audio: Permissive (CC0, Apache, MIT, CC-BY)
This audio should *impress* most native speakers, not just barely pass their audio Turing tests. Professional-caliber means S or A-tier, not your average bloke off the street. Traditional TTS may not make the cut. Absolutely no low-fi microphone recordings like Common Voice.
The bar is much higher than last time, so there are no timelines yet and I expect it may take longer to collect such mythical data. Raising the bar means evicting quite a bit of old data, and voice/language availability may decrease. The theme is *quality* over quantity. I would rather have 1 hour of A/S-tier than 100 hours of mid data.
I have nothing to offer but the north star of a future Apache 2.0 TTS model, so prefer data that you *already have* and costs you *nothing extra* to send. Additionally, *all* the new data may be used to construct public, Apache 2.0 voicepacks, and if that arrangement doesn't work for you, no need to send any audio.
Last time I asked for horses; now I'm asking for unicorns. As of writing this post, I've currently got a few English & Chinese unicorns, but there is plenty of room in the stable. Find me over on Discord at
+ AVM data: ChatGPT Advanced Voice Mode audio & text from source
+ Professional audio: Permissive (CC0, Apache, MIT, CC-BY)
This audio should *impress* most native speakers, not just barely pass their audio Turing tests. Professional-caliber means S or A-tier, not your average bloke off the street. Traditional TTS may not make the cut. Absolutely no low-fi microphone recordings like Common Voice.
The bar is much higher than last time, so there are no timelines yet and I expect it may take longer to collect such mythical data. Raising the bar means evicting quite a bit of old data, and voice/language availability may decrease. The theme is *quality* over quantity. I would rather have 1 hour of A/S-tier than 100 hours of mid data.
I have nothing to offer but the north star of a future Apache 2.0 TTS model, so prefer data that you *already have* and costs you *nothing extra* to send. Additionally, *all* the new data may be used to construct public, Apache 2.0 voicepacks, and if that arrangement doesn't work for you, no need to send any audio.
Last time I asked for horses; now I'm asking for unicorns. As of writing this post, I've currently got a few English & Chinese unicorns, but there is plenty of room in the stable. Find me over on Discord at
rzvzn: https://discord.gg/QuGxSWBfQy
reacted to
bartowski's
post with β€οΈ
over 1 year ago
Post
10124
So turns out I've been spreading a bit of misinformation when it comes to imatrix in llama.cpp
It starts true; imatrix runs the model against a corpus of text and tracks the activation of weights to determine which are most important
However what the quantization then does with that information is where I was wrong.
I think I made the accidental connection between imatrix and exllamav2's measuring, where ExLlamaV2 decides how many bits to assign to which weight depending on the goal BPW
Instead, what llama.cpp with imatrix does is it attempts to select a scale for a quantization block that most accurately returns the important weights to their original values, ie minimizing the dequantization error based on the importance of activations
The mildly surprising part is that it actually just does a relatively brute force search, it picks a bunch of scales and tries each and sees which one results in the minimum error for weights deemed important in the group
But yeah, turns out, the quantization scheme is always the same, it's just that the scaling has a bit more logic to it when you use imatrix
Huge shoutout to @compilade for helping me wrap my head around it - feel free to add/correct as well if I've messed something up
It starts true; imatrix runs the model against a corpus of text and tracks the activation of weights to determine which are most important
However what the quantization then does with that information is where I was wrong.
I think I made the accidental connection between imatrix and exllamav2's measuring, where ExLlamaV2 decides how many bits to assign to which weight depending on the goal BPW
Instead, what llama.cpp with imatrix does is it attempts to select a scale for a quantization block that most accurately returns the important weights to their original values, ie minimizing the dequantization error based on the importance of activations
The mildly surprising part is that it actually just does a relatively brute force search, it picks a bunch of scales and tries each and sees which one results in the minimum error for weights deemed important in the group
But yeah, turns out, the quantization scheme is always the same, it's just that the scaling has a bit more logic to it when you use imatrix
Huge shoutout to @compilade for helping me wrap my head around it - feel free to add/correct as well if I've messed something up