- Tech Insights 2025 Week 24by Johan Sanneblad
I listen to a lot of audio books, and always keep a book playing when out for a walk or a run. The joy of finding an amazing narrator like Bill Homewood reading The Count of Monte Cristo by Alexander Dumas is one of many reasons I keep listening. There are however many books at Audible that are completely ruined by poor narratives, and I have a list of over a dozen books where reviews warned about the poor performance but I went ahead and bought them anyway, but just couldn’t stand it. If you are an avid audio book listener you know exactly what I’m talking about.
Therefore I am double excited about all the news when it comes to text-to-speech this week, such as the truly amazing Eleven v3 model by ElevenLabs. If you have a minute and want to see exactly how far we have come when it comes to emotions in text-to-speech, go checkout their launch video. It’s amazing. It will change your mind on just how good text-to-speech AI has gotten in just a few months. Hopefully companies such as Audible will adopt this quickly, and that all audio books soon will have the option of “AI narrator” as an alternative. When it comes to narrators such as Bill Homewood I guess most people would pick him, but I’m quite sure I would pick the AI narrator in at least 30% of all books I have listened to so far, maybe more.
When it comes to coding with Claude Code I am more enthusiastic than ever about agentic coding and I have just finished and submitted a new plugin for Obsidian for review. This time it’s a complete rewrite of the entire user interface of Obsidian, making it look and function exactly like Apple Notes (with two panes and keyboard navigation) but with a good hierarchical tag browser built in as well. Everything was 100% written, tested and documented with Claude Code, and I even did a major refactor to REACT half-way through. I have gone through all lines by hand, and the quality of the code produced by Claude 4 Opus is nothing short of amazing. BUT you need to prompt it the right way, and never send it off to analyze and implement things in one shot. Make a plan, discuss the plan, ask it to think, keep notes of progress, and most importantly always ask it to review the code it generated at least once, preferably two times. Claude Code always finds things to improve in the generated code, even things you might have missed. My new plugin Notebook Navigator for Obsidian should be out in the Community Browser hopefully later this week.
Listen to Tech Insights on Spotify: Tech Insights 2025 Week 24 on Spotify
Thank you for being a Tech Insights subscriber!
THIS WEEK’S NEWS:
- ElevenLabs Releases v3 Text-to-Speech Model with Audio Tag Controls
- ElevenLabs Launches Conversational AI 2.0 with Advanced Turn-Taking and Multimodal Support
- Cursor 1.0 Released with Automated Code Review and Background Agents
- Google Previews Upgraded Gemini 2.5 Pro with Enhanced Coding Performance
- Apple Research Challenges AI Reasoning Claims in LLM Study
- Volvo Introduces AI-Powered Multi-Adaptive Seatbelt for EX60 Electric SUV
- Claude Projects Expands Knowledge Capacity 10x with Automatic RAG Technology
- ChatGPT Launches Connectors for Third-Party Apps and Transcription Features
- Reddit Sues Anthropic Over Unauthorized AI Training Data Scraping
- Hanabi AI Launches OpenAudio S1 AI Voice Actor with Real-Time Emotional Control
- Hume AI Launches EVI 3 Voice Model with Custom Voice Creation
- Perplexity AI Adds SEC Filing Search to Democratize Financial Data Access
- Mistral Launches Enterprise Coding Assistant Using Open Models
- NVIDIA Releases Llama Nemotron Nano VL for OCR and Advanced Document Processing
- Luma AI Launches Modify Video Tool Powered by Ray2 Model
ElevenLabs Releases v3 Text-to-Speech Model with Audio Tag Controls
https://elevenlabs.io/blog/eleven-v3
The News:
- ElevenLabs launched Eleven v3 (alpha), a text-to-speech model that lets users control voice emotion, tone, and sound effects through inline audio tags like [whispers], [angry], [laughs], and [door creaks].
- The model supports over 70 languages, expanding from 33 languages and increasing global population coverage from 60% to 90%.
- Dialogue Mode enables multi-speaker conversations with natural interruptions, tone shifts, and emotional flow between characters.
- Users can combine multiple tags for complex control, such as “[whispers] Something’s coming… [sighs] I can feel it” or switch accents mid-conversation.
- The model is available at 80% off until the end of June 2025 for self-serve users through the UI, though Professional Voice Clones are not yet optimized for v3.
My take: Stop what you are doing and go watch their launch video! It’s 4 minutes long but you’ll get the point from watching the first minute. This is it, really. We have solved text-to-speech. Not only for reading texts, but doing it with emotion. This will change everything when it comes to presenting, talking, reporting and reading texts. There are still some giveaways that tells that the voices are AI-generated, but I’m quite sure these will be fixed quite soon.
Read more:
ElevenLabs Launches Conversational AI 2.0 with Advanced Turn-Taking and Multimodal Support
https://elevenlabs.io/blog/conversational-ai-2-0
The News:
- ElevenLabs released Conversational AI 2.0, an updated voice agent platform that enables businesses to create more natural-sounding AI assistants for customer service, healthcare, and sales applications.
- The platform features a turn-taking model that analyzes conversational cues in real-time, determining when to interrupt or wait during dialogue to reduce awkward pauses and interruptions.
- Agents can now process both voice and text inputs simultaneously, allowing users to switch between speaking and typing within the same conversation.
- Built-in Retrieval-Augmented Generation (RAG) enables agents to access external knowledge bases with low latency, such as medical assistants consulting treatment protocols from healthcare databases.
- Automatic language detection supports 40+ languages, allowing agents to identify and respond in different languages without manual setup.
- The platform includes batch calling functionality for automated outbound communications like surveys and alerts, plus enterprise features including HIPAA compliance and optional EU data residency.
My take: In addition to Eleven v3 (above) that lets users control voice emotion, tone and sound effects, Conversational AI 2.0 now means voice assistants know when to pause, speak and take turns. ElevenLabs launched their first Conversational AI 1.0 just six months ago in December 2024, showing just how quickly this area is developing. If you are developing conversational AI for customer support, then this release with features such as automatic language detection and the new turn-taking model will make a big difference in your work.
Read more:
Cursor 1.0 Released with Automated Code Review and Background Agents
https://www.cursor.com/changelog
The News:
- Cursor, the AI-powered code editor, released version 1.0 with automated code review capabilities that can save developers time on manual PR reviews and bug detection.
- BugBot automatically analyzes GitHub pull requests, identifies potential code errors, and leaves detailed comments with one-click fixes directly in the Cursor editor.
- Background Agent became available to all users, allowing developers to create AI agents that run in remote environments, clone repositories, work on separate branches, and push changes automatically.
- The update includes Jupyter Notebook support where agents can create and edit multiple cells directly, improving workflows for data science and research tasks.
- One-click MCP (Model Context Protocol) setup and a preview of the Memories feature that stores facts from previous AI conversations were also added.
My take: If you are using Cursor, then this is the update you have been waiting for. So many great features in there, like the BugBot and Background agent, but also excellent support for Jupyter Notebooks. I do however feel that Cursor is moving into uncharted territory with these new autonomous features like BugBot and Background agent. To use them you need to enable Max mode in Cursor, which means you need to pay per message sent to Claude. And if you want the best performance you need to use Claude 4 Opus, and that model typically cost hundreds of dollars every day if you do not have a Claude Max subscription. Problem is however that you cannot use the flat-rate Claude Max subscription in Cursor, you can only use it with Claude Code. So it will be EXPENSIVE to use Cursor for agentic coding and bug fixing.
If you want to move into autonomous coding (you should!), then you might want to follow the same route as myself and thousands of other developers and ditch Cursor and move fully into autonomous coding with VSCode and Claude Code. It’s a thrilling new way to develop software, and I have never been more productive than with this setup. Cursor will never be able to offer flat-rate subscriptions for unlimited usage since they do not own the models that do the actual coding, which is why the future for Cursor looks quite uncertain especially compared to fully agentic systems like Claude Code that provides unlimited usage for a fixed monthly fee.
Read more:
Google Previews Upgraded Gemini 2.5 Pro with Enhanced Coding Performance
https://blog.google/products/gemini/gemini-2-5-pro-latest-preview
The News:
- Google released an upgraded preview of Gemini 2.5 Pro, its flagship AI model designed for coding and reasoning tasks, addressing developer feedback about previous performance issues.
- The model achieved a 24-point Elo score improvement on LMArena (reaching 1470) and a 35-point jump on WebDevArena (1443), maintaining top positions on both leaderboards.
- Gemini 2.5 Pro now scores 82.2% on the Aider Polyglot coding benchmark, surpassing all competition from OpenAI, Anthropic, and DeepSeek.
- Google added “thinking budgets” that allow developers to control computational costs and response latency for complex queries.
- The model costs $1.25 per million input tokens and $10 per million output tokens for prompts up to 200,000 tokens, making it Google’s most expensive AI model.
My take: So, why is it a big deal that Google Gemini 2.5 Pro now exceeds all other models on one benchmark (Aider Polyglot) but is far below them on other benchmarks like SWE-bench agentic coding? Here’s why this is important: The Aider Polyglot benchmark shows how good the model is at different programming languages, such as C++, Go, Java, JavaScript, Python and Rust. If you are using C++ or Java and use AI primarily for auto-complete like Github Copilot or Cursor with Agent mode switched off, then the Polyglot benchmark is what you should be looking for. And this is where Gemini really shines!
On the other hand, if you are now moving into agentic coding with tools like Claude Code, and primarily code in Typescript or Python, then the SWE-bench Verified is what you should be looking for. This is how good the model is at autonomous coding, to complete workflows by itself and reason deep about source code. The SWE-bench Verified better predicts performance on real-world autonomous coding tasks. I believe most organizations are better off looking at the Aider Polyglot benchmark, but the few of us that has taken the steps into the fully autonomous way of working will only be looking for SWE- bench results.
Read more:
Apple Research Challenges AI Reasoning Claims in LLM Study
The News:
- Apple researchers published a study questioning the mathematical reasoning capabilities of large language models from OpenAI, Google, and Meta, arguing these systems rely on pattern matching rather than genuine logical reasoning.
- The team created GSM-Symbolic, a new benchmark that modifies existing math problems with small changes like switching names or adding irrelevant information to test reasoning consistency.
- Performance dropped across all tested models when problems included minor modifications, with smaller models showing the biggest decline and even GPT-4o dropping by 0.3% while OpenAI’s o1-preview dropped by 2.2%.
- Adding a single irrelevant clause to math problems caused accuracy to drop by up to 65%, with models like o1-mini and Llama3-8B incorrectly subtracting irrelevant details from their calculations.
- The study concluded that “current LLMs are not capable of genuine logical reasoning” and “instead, they attempt to replicate the reasoning steps observed in their training data”.
My take: Apple’s track record within AI is not the best, and the user feedback with this report has be quite mixed. Joakim Edin made a good summary of this report, writing “while I find their findings interesting, they failed to consider alternative hypotheses. Instead of showing that LLMs can’t reason, they may have shown that their logical reasoning is sometimes flawed”. For all of us who work deeply with AI models, we are quite aware that models do not perform formal reasoning, and instead always try to mimic with probabilistic pattern-matching of the closest similar data seen in their vast training sets. When we say that models “think”, we refer to test-time-compute in the inference phase, however recent models have been fine-tuned so well that many people actually believe they have a mind of their own. GenZ especially is increasingly turning to ChatGPT for on-demand therapy, and good luck trying to convince them that ChatGPT does not think.
So what does this news mean for you as an AI user? It means that you should be careful when prompting to get the best results. The models are not “smart”, and the less information you send to it the higher chance it will give you what you are searching for. Quite similar to when giving other people instructions, the more information you give the higher risk they will mess things up.
Read more:
- Has Apple proven that large language models are incapable of logical reasoning?
- What do you all think of the latest Apple paper on current LLM capabilities? : r/MachineLearning
- People are increasingly turning to ChatGPT for affordable on-demand therapy, but licensed therapists say there are dangers many aren’t considering | Fortune
Volvo Introduces AI-Powered Multi-Adaptive Seatbelt for EX60 Electric SUV
https://www.autonews.com/volvo/ane-volvo-ai-assisted-seat-belt-ex60-debut-0605
The News:
- Volvo’s “multi-adaptive safety belt” debuts in the 2026 EX60 electric SUV, using sensors to customize crash protection based on passenger size, weight, body shape, seating position, and crash characteristics.
- The system expands load-limiting profiles from three to 11 settings, allowing more precise force adjustments during crashes compared to traditional seatbelts that apply uniform force regardless of occupant size.
- Larger occupants receive higher belt tension to reduce head injury risk in severe crashes, while smaller passengers get lower tension to prevent rib fractures in milder impacts.
- Interior and exterior sensors analyze crash direction, speed, and passenger posture “in the blink of an eye” to select optimal belt settings automatically.
- The system receives over-the-air software updates to improve performance using real-world crash data from Volvo’s database of 80,000 accident cases collected over five decades.
My take: A seatbelt that receives over-the-air software updates, how does that sound to you? 😂 Well first of all, calling this seatbelt “AI” is maybe stretching things. It’s a real-time rule-based system based on sensor data that applies various amount of tension based how the person is seated and the characteristics of the crash. I understand how in theory this could be a good thing, but I can also see so many different types of crashes that makes it very easy to miss scenarios when developing this kind of system. Take for example the Hövding bike helmet, where Swedish Consumer Agency stuck to it’s decision that Hövding 3 is a risk since it can miss certain types of accidents. If I got to choose between a dumb belt that works 100% of the times to a smart belt that updates itself over-the-air and tries to make smart decision how to apply tension, I would pick the dumb belt 100% of the time, at least until we have had a few years of usage to prove otherwise.
Read more:
Claude Projects Expands Knowledge Capacity 10x with Automatic RAG Technology
https://support.anthropic.com/en/articles/11473015-retrieval-augmented-generation-rag-for-projects
The News:
- Claude Projects now automatically enables Retrieval Augmented Generation (RAG) when project knowledge approaches context window limits, expanding storage capacity by up to 10x while maintaining response quality.
- The system switches seamlessly between standard context processing and RAG mode without user setup, using a project knowledge search tool to retrieve relevant information from uploaded documents instead of loading all content into memory.
- RAG activates automatically when projects exceed context limits and can convert back to context-based processing when knowledge drops below the threshold.
- The feature works with all Claude tools including web search, extended thinking, and Research, maintaining consistent response quality as in-context processing.
- Available for all paid Claude.ai plans including Pro, Max, Team, and Enterprise.
My take: Previously Claude projects did not have a semantic index of all files in a project, making it subpar in performance to ChatGPT Projects that always stores all text in all project files as vector embeddings. The benefit with Claudes approach is that it will try to fit everything in the context if possible, where ChatGPT will always use RAG approach to retrieve data. If you use Claude, this is a great improvement that will finally make projects quite usable even with large amounts of documents or data.
Read more:
ChatGPT Launches Connectors for Third-Party Apps and Transcription Features
https://help.openai.com/en/articles/11487775-connectors-in-chatgpt
The News:
- OpenAI launched Connectors for ChatGPT, a beta feature that integrates third-party applications like Google Drive, GitHub, and SharePoint directly into conversations, allowing users to access their own data without switching between platforms.
- Chat search connectors enable quick file lookups with results appearing inline, such as asking “Show me Q2 goals in Drive” or “Find last week’s roadmap in Box”.
- Deep research connectors analyze complex queries across multiple sources simultaneously, producing fully cited reports that combine internal documents with web information.
- Synced connectors pre-index selected content from Google Drive to provide faster responses and improved answer quality without re-querying sources each time.
- The feature also includes meeting recording and transcription capabilities that generate timestamped notes and suggest action items, with recordings automatically deleted after transcription.
- Custom connectors using the Model Context Protocol (MCP) allow developers to connect proprietary systems and internal tools for Team, Enterprise, and Edu users.
My take: I feel few people have actually reported on what this feature really does. Instead of having to upload documents to ChatGPT for processing, you can now allow ChatGPT to automatically find and read specific documents directly from, say, your Dropbox or Box account. It does not index your files, and it does not keep a vector database of your files. This means it has no clue at all over what the contents is of all your files, so you cannot ask ChatGPT to find documents that contain things of semantic meaning. ChatGPT do however have this for Google Drive which they called a Synced Connector. I’m not too enthusiastic about all these new basic Connectors they added, I am instead hoping they will add Synced Connectors for both Dropbox and Sharepoint in addition to Google Drive. The new Meeting Transcription feature is nice however, and is a serious competitor to Microsoft Copilot 365. Many people I know only bought a Copilot license to get transcriptions in Teams meetings, now you can have it with your ChatGPT license. And in contrast to Microsoft Copilot 365, meeting participants do not see if you transcribe the Teams meeting with ChatGPT.
Reddit Sues Anthropic Over Unauthorized AI Training Data Scraping
The News:
- Reddit filed a lawsuit against AI startup Anthropic on June 4, 2025, alleging the company illegally scraped user content to train its Claude chatbot without permission or compensation.
- The lawsuit claims Anthropic accessed Reddit’s servers over 100,000 times since 2024, violating the platform’s terms of service and user agreement.
- Reddit alleges Anthropic trained Claude on content from high-quality subreddits including r/science, r/explainlikeimfive, r/AskHistorians, and r/programming, with the AI model retaining even deleted posts.
- The social media platform seeks an injunction to stop further data use, deletion of all Reddit-derived training material, financial restitution, and punitive damages.
- Reddit’s chief legal officer Ben Lee stated: “We will not tolerate profit-seeking entities like Anthropic commercially exploiting Reddit content for billions of dollars without any return for redditors or respect for their privacy”.
My take: Both Google and OpenAI have signed licensing agreements with Reddit, that compensate the platform and include user privacy protections. If you somehow thought Anthropic were the “good guys” especially with their strong focus on Ethical AI, well I guess we can conclude that most AI companies are the same when it comes to how they view information owned by other companies. If you didn’t already know it, Sam Altman is personally one of the largest investors in Reddit, holding a stake valued at well over $1 billion. This lawsuit is very similar to the one I reported about last week, with The New York Times signing an agreement with Amazon while at the same time suing OpenAI. It’s clear now that most content we have won’t be available to all AI companies, but only those with the strongest partnerships and the deepest pockets.
Hanabi AI Launches OpenAudio S1 AI Voice Actor with Real-Time Emotional Control
https://openaudio.com/blogs/s1
The News:
- Hanabi AI released OpenAudio S1, a text-to-speech model that generates emotionally expressive speech with real-time control over tone, pitch, and emotional nuance.
- The model supports over 50 emotions and tone markers, from basic commands like “(angry)” and “(sad)” to complex instructions such as “(confident but hiding fear)” or “(whispering with urgency)”.
- OpenAudio S1 achieves sub-100ms latency and ranks #1 on Hugging Face’s TTS Arena leaderboard, outperforming ElevenLabs 2.5, OpenAI, and Cartesia in expressiveness benchmarks.
- The system uses a 4-billion parameter architecture trained on 2 million hours of audio data and supports voice cloning with just 10-30 seconds of reference audio.
- The model handles 11 languages and can seamlessly switch between multiple speakers within a single audio clip.
- OpenAudio S1 is available in open beta on fish.audio for free trial, with paid plans starting at $15/month.
My take: Well it wasn’t hard to outperform ElevenLabs 2.5, OpenAI or Cartesia in expressiveness benchmarks where those model had no expressiveness before. Hanabi AI is a small “four-people Gen Z company” and despite their name the models they develop are not open-source or open-weight. If you want to explore expressive text-to-speech I would strongly recommend ElevenLabs v3 I wrote about above, but if you feel experimental you might want to try Hanabi AI out.
Hume AI Launches EVI 3 Voice Model with Custom Voice Creation
https://www.hume.ai/blog/introducing-evi-3
The News:
- Hume AI released EVI 3, an Empathic Voice Interface model that creates custom AI voices through natural language descriptions, eliminating the need for technical configuration or complex attribute adjustments.
- Users describe desired voice characteristics verbally, such as “a high-pitched, laid-back voice with a sarcastic edge and a New York accent,” and the model generates the voice in under one second.
- The model offers over 30 voice styles including preprogrammed personalities like “Old Knocks Comedian,” “Seasoned Life Coach,” and “Wise Wizard”.
- EVI 3 achieves 300ms response latency on high-end hardware and outperformed GPT-4o in blind testing across empathy, expressiveness, and naturalness metrics.
- The system recognizes emotions in user speech by analyzing pitch, rhythm, and timbre, then adjusts its emotional tone throughout conversations.
- Currently available through Hume’s demo platform and iOS app, with API access planned for release within weeks.
My take: Hume AI was founded in 2021 by Alan Cowen, former Google scientist, and raised $50M last year. In blind testing with 1,720 participants, EVI 3 outperformed OpenAI’s GPT-4o in seven conversational dimensions: amusement, audio quality, empathy, expressiveness, interruption handling, naturalness, and response speed. Conversational AI and text-to-speech has seen an enormous growth the past few months, and we are rapidly approaching the point where systems will be good enough for most customer interactions in all organizations. You can try EVI 3 online, at https://demo.hume.ai/, I thought it was quite good, and I am looking forward to their API release.
Read more:
- Hume AI
- Emotive voice AI startup Hume launches new EVI 3 model with rapid custom voice creation | VentureBeat
Perplexity AI Adds SEC Filing Search to Democratize Financial Data Access
https://www.perplexity.ai/hub/blog/answers-for-every-investor
The News:
- Perplexity AI launched SEC/EDGAR integration last week, allowing all users to query Securities and Exchange Commission filings through natural language questions. This feature makes complex financial documents accessible to retail investors, students, and professionals without requiring expensive subscriptions or specialized knowledge.
- Users can ask questions about company earnings, risks, strategies, and financial performance using plain English and receive answers directly sourced from official SEC documents like 10-K annual reports, 10-Q quarterly filings, and 8-K event reports.
- Every response includes direct citations to source documents, enabling users to verify information and explore deeper details from the original filings.
- The integration works across Perplexity’s Search, Research, and Labs features, allowing users to combine SEC data with market analysis, news coverage, and industry research in single conversations.
- Enterprise Pro customers gain additional access to FactSet’s M&A and transcript data alongside Crunchbase’s firmographic data for comprehensive comparative analysis.
My take: Traditional financial data platforms like Bloomberg Terminal and Capital IQ require expensive subscriptions costing thousands of dollars annually and complex interfaces designed for professional analysts. Typically they also limit access through paywalls and technical complexity, leaving retail investors to rely on simplified summaries or fragmented information sources. This approach is completely different. Perplexity now offers free access to the same underlying SEC data that drives professional investment decisions, presented through conversational AI rather than complex dashboards. If you are working as investor or are even the slightly interested in this, you might want to check it out right away. It’s even available for free users.
Read more:
Mistral Launches Enterprise Coding Assistant Using Open Models
https://mistral.ai/news/mistral-code
The News:
- Mistral AI released Mistral Code, an enterprise-focused coding assistant that addresses security and customization concerns blocking many organizations from adopting mainstream AI coding tools.
- The platform bundles four specialized models: Codestral for code completion, Codestral Embed for code search and retrieval, Devstral for complex multi-step coding tasks, and Mistral Medium for chat assistance.
- Mistral Code supports over 80 programming languages and can reason across files, Git diffs, terminal output, and issue tracking systems.
- The service offers three deployment options: cloud, reserved capacity, or air-gapped on-premises hardware.
- Enterprises can fine-tune or post-train the underlying models on private repositories, a capability that “simply doesn’t exist in closed copilots tied to proprietary APIs”.
- The platform entered private beta for JetBrains IDEs and VSCode, with general availability planned soon.
My take: Mistral Codestral was released over a year ago in May 2024, and is generally considered to not be among the best for code-completion AI models. Instead of improving it, Mistral now releases it as part of a “bundle” with the key USP being that organizations can fine-tune or post-train it on your private repositories. Will that make it on par with Gemini 2.5 Pro or Claude 4 Opus for coding? No. The main reason for running Codestral at all would be if you cannot for any reason run cloud-based models for anything at all, but you still want some of the benefits for auto-complete AI coding. Then this Mistral Code bundle might be for you. Mind you it’s only for really large companies like Abanca (Spanish bank) and SNCF (France’s national railway), if you are any smaller company than that you might have problems just getting into contact with Mistral.
NVIDIA Releases Llama Nemotron Nano VL for OCR and Advanced Document Processing
The News:
- NVIDIA launched Llama Nemotron Nano VL, an 8-billion parameter vision-language model that extracts information from documents, charts, tables, and diagrams while running on a single GPU.
- The model achieved first place on the OCRBench v2 benchmark, demonstrating superior performance in optical character recognition and document analysis tasks.
- Built on Llama 3.1 architecture with CRadioV2-H vision encoder, the model supports up to 16K context length and processes multiple images within document sequences.
- NVIDIA provides 4-bit quantization support through AWQ technology, enabling deployment on edge devices like Jetson Orin and laptops via the TinyChat framework.
- The model handles complex document types including scanned forms, financial reports, technical diagrams, and multi-page documents with text and table parsing capabilities.
- Available through NVIDIA’s NIM API and downloadable from Hugging Face, the model supports both server and edge inference scenarios.
My take: It’s very hard to evaluate just how good this model is. Sure, it’s in “first place” in OCRBench v2, but look at the list of models in that benchmark and you will mostly find older models in that benchmark like GPT-4V from 2023 and Qwen2 from June 2024. This just again shows the difficulty of assessing all the AI news pouring out every day. For everyone living in the EU it doesn’t really matter anyway. Since the model is built on the Llama 3.1 architecture you cannot legally use it within the European Union due to Meta’s EU restrictions. If you are living in the EU just ignore this model, and go try something like Mistral OCR if you want to convert documents into markup text.
Read more:
Luma AI Launches Modify Video Tool Powered by Ray2 Model
https://lumalabs.ai/blog/news/introducing-modify-video
The News:
- Luma AI released Modify Video, an AI-powered tool that transforms existing video footage by changing styles, backgrounds, characters, or objects while preserving original motion and camera dynamics.
- The tool processes videos up to 10 seconds long and outputs at 720p or 1080p resolution, requiring around 400 credits for 5-second clips and 800 credits for 10-second videos.
- Users can upload any video footage and modify it through text prompts or visual references, with three transformation presets: Adhere (minimal changes), Flex (balanced creativity), and Reimagine (full scene reinterpretation).
- The system maintains facial expressions, lip sync, body language, and camera movements from the original footage while applying the requested changes.
- Modify Video is available with all paid Dream Machine subscriptions, with the highest plan costing $66.49 per month for unlimited HD generation.
My take: Luma conducted blind testing that showed Modify Video outperformed Runway’s video-to-video generation tool in motion preservation, facial animation, and temporal consistency. While AI video modification already exists from competitors like Runway and Pika, Luma claims superior performance fidelity in maintaining actor performances and creating organic-looking results rather than stitched-together elements. User feedback has so far been mixed, as is usual with these AI video generation engines (except for maybe Veo 3 which most people seem to like), but if you are working with video you might want to check it out. Just be aware that the results do look very cartoonish and artificial, Luma AI is way behind Google Veo 3 when it comes to realistic appearance.