• Tech Insights 2025 Week 14

    Last week OpenAI updated their GPT-4o model with image creation capabilities. If you had previously used ChatGPT to create images using their model “DALL-E 3” you know it was quite a poor performer, so maybe you skipped this news due to low expectations. But the image generation capabilities of GPT-4o is completely unlike anything you have seen before. First, like with Gemini 2.0 Flash, GPT-4o now processes text and images through a unified system, not as separate tasks. This means that the model uses the same neural pathways for understanding both language and visual content, and can access its entire knowledge base and conversation context when creating images. It understands what you want it to create, and it understands how it should create it. Is it perfect? Absolutely not. You will still get artifacts and probably need to re-render each image a few times for best results. But is it good enough for most use cases? Absolutely! I am quite sure we will see an explosion in AI-art unlike anything we have seen so far, similarly to how ubiquitous large language models have become for text and source code generation.

    Anthropic published results from an “AI Microscope” they developed to peek inside the inner workings of a large language model. The research revealed that Claude plans ahead when generating content, particularly when writing poetry, by first selecting appropriate rhyming words and then building lines to lead toward those targets. Anthropic says “The many examples in our paper only makes sense in a world where the models really are thinking in their own way about what they say”. Very interesting research and if you have 3 minutes go watch their video about it.

    Thank you for being a Tech Insights subscriber!

    THIS WEEK’S NEWS:

    1. OpenAI’s GPT-4o Image Generator Transforms AI Image Generation
    2. Anthropic’s AI Microscope Reveals How LLMs Think Like Human Brains
    3. Users Abandoning Cursor Due To Context Limitations
    4. Google Launches Gemini 2.5 Pro, Claims Top Spot on LMArena
    5. AI Model “ECgMLP” Achieves Near-Perfect Accuracy in Cancer Detection
    6. Alibaba Cloud Launches Three New AI Models in One Week
    7. Ideogram Launches Ideogram 3.0 with Enhanced Photorealism and Style Controls
    8. Reve Image 1.0 Tops Global AI Image Generation Rankings
    9. DeepSeek V3-0324: Powerful Open-Source AI Model That Runs on Consumer Hardware
    10. Tencent Launches Hunyuan-T1 Reasoning Model to Compete in China’s AI Race
    11. Kyutai Launches MoshiVis: First Real-Time Speech-to-Speech Vision Model
    12. Anthropic Launches “Think” Tool for Claude to Improve Complex Problem-Solving

    OpenAI’s GPT-4o Image Generator Transforms AI Image Generation

    https://openai.com/index/introducing-4o-image-generation

    The News:

    • OpenAI released a new image generation system integrated directly into GPT-4o, offering improved photorealism, text rendering, and the ability to refine images through natural conversation.
    • The system leverages GPT-4o’s knowledge base and conversation context when creating images, allowing it to transform uploaded images, maintain visual consistency across multiple edits, and support complex prompts with up to 20 different objects.
    • Unlike previous DALL-E models, GPT-4o’s image generator uses an autoregressive approach, creating images gradually from left to right and top to bottom instead of all at once, which improves accuracy and realism.
    • The tool quickly went viral for its ability to create Studio Ghibli-style images, with users flooding social media with AI-generated pictures reimagining personal photos and famous scenes in this distinctive style.
    • Due to overwhelming demand, OpenAI has temporarily limited access to paid subscribers and imposed rate limits, with CEO Sam Altman announcing that “our GPUs are melting” and that free tier users will soon be limited to three image generations per day.

    What you might have missed: The things you can do with this new feature is just outstanding. Here are a few examples:

    My take: Two weeks ago I wrote about the amazing Google Native Image Generation with Gemini 2.0 Flash, which was the first multimodal model that could both understand and create images. This week OpenAI took it to a whole different level. When GPT-3 was released in June 2020 it was a monumental improvement, going from 1.5 billion parameters (GPT2) up to a staggering 175 billion parameters. But it wasn’t until GPT-4 was released in 2023 where it was “good enough” for most text writing that usage started to grow for real. I feel that the image generator in GPT-4o is similar in importance to the launch of GPT-4 in 2023. It’s good enough for most use cases, and it will have a profound effect to how we create and use images going forward.

    Anthropic’s AI Microscope Reveals How LLMs Think Like Human Brains

    https://www.anthropic.com/research/tracing-thoughts-language-model

    The News:

    • Anthropic has developed a tool inspired by neuroscience to understand how large language models work, aiming to make their inner workings more transparent. This tool is likened to an “AI microscope” or “brain scanner” that helps identify patterns of activity and information flows within LLMs.
    • The research reveals that Claude plans ahead when generating content, particularly when writing poetry, by first selecting appropriate rhyming words and then building lines to lead toward those targets—challenging the notion that LLMs only process one token at a time.
    • Claude appears to use a language-independent internal representation (a “universal language of thought”) where concepts are processed in a shared conceptual space across different languages.
    • The tool identified that LLMs can create fictitious reasoning processes or “alignment faking,” where Claude will provide plausible-sounding but incorrect explanations, especially in math tasks with false clues.
    • The current version of the tool can only capture a fraction of the total computation performed by Claude and requires several hours of manual work to understand how it answers even simple prompts.

    My take: It is common knowledge that Large Language Models generate one token at a time, where it constantly feeds the input question + generated response into itself to generate the next word. A few years ago many people downplayed LLMs as dumb “next token predictors” but this paper shows that LLMs do indeed plan quite far ahead even when replying with just one token at a time. “The many examples in our paper only makes sense in a world where the models really are thinking in their own way about what they say” (from the Anthropic YouTube video). It seems LLMs are inherently much “smarter” than most people initially thought. If you have 3 minutes I really recommend the video they posted.

    Read more:

    Users Abandoning Cursor Due To Context Limitations

    https://forum.cursor.com/t/cursor-is-getting-worse-and-worse/66070

    The News:

    • Cursor, the AI-powered code editor that integrates with models like Claude 3.7, is facing user backlash over recent performance issues and pricing changes.
    • Users report that since version 0.46, the IDE has become increasingly sluggish, with many experiencing freezing, crashing, and AI that no longer follows instructions accurately.
    • Specific complaints include the AI failing to locate referenced files, creating incomplete content, executing unnecessary instructions, and requiring significantly more prompts to complete tasks that previously took only a few interactions.
    • Performance issues are particularly pronounced when working with large code files (over 500 lines), where editing can take several minutes or even fail entirely.

    My take: I have created thousands of lines of code with Cursor + Claude the past two weeks when I created a Python program to generate an MP4 video from all my newsletters, and for me Cursor works just as well as it did 6 months ago. I think the main difference between me and most people who post about all their recent issues is that I know my code base inside and out, and the way I use Cursor is that I tell it exactly what code I want it to write and where it should put it.

    I have been very clear in all my seminars that 2025 is not the year when everyone without coding skills will be able to start a programming career thanks to AI. The simple reason for this is that AI models cannot hold your entire code base in their context window, which means that you still need to know exactly what your code does and how it is structured. This also means that you need to steer Cursor to keep your code structured so it grows in a technically sound way.

    The reason why people are getting these issues now is that Anthropic recently started to introduce heavy rate limits into their API which means Cursor had to add a limit where only 250 lines at the time are sent to the language model. For users that were used to vibe coding and asked the LLM to do changes based on functional requirements instead of the code base structure the results were catastrophic, since Claude no longer had the full context available for analysis.

    If we want to reach a future where non-programmers can use LLMs to develop advanced software applications we need much higher context windows and much higher API limits (Claude as an example allows only up to 20k tokens per minute, if you send it a full 200k token window you have to wait 10 minutes to send next request). Google Gemini 2.5 Pro mentioned below has a massive 1 million token context window and a rate limit of 2 million tokens per minute, 100 times more than Claude 3.7. This is what we need for people not used to programming to develop complex software applications, and I already see lots of non-programmers actively switching from Cursor + Claude into something like Roo Code + Gemini. Myself I like programming and I like designing software architectures, so I will stick with Cursor + Claude for now since it works very well with how I work.

    If you are using Cursor + Claude, here is a quick guidebook with 10 tips from someone who shipped over 17 products with it.

    Read more:

    Google Launches Gemini 2.5 Pro, Claims Top Spot on LMArena

    https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025

    The News:

    • Google just launched Gemini 2.5 Pro, a new AI model designed for complex reasoning tasks and improved performance across various benchmarks.
    • The model features a 1 million token context window, with plans to expand to 2 million, enabling it to process huge datasets and code bases. As a comparison, Claude 3.7 Sonnet has a 200k context window and ChatGPT Plus and ChatGPT Teams only has 32k.
    • The Gemini 2.5 Pro API also allows up to 2 million tokens per minute, which is 100 times more than Claude 3.7 that only allows 20 000 tokens per minute (if you send Claude 200 000 tokens you need to wait 10 minutes before you send another request).
    • Gemini 2.5 Pro achieved 18.8% accuracy on Humanity’s Last Exam and outperformed competitors on benchmarks like GPQA and AIME 2025, scoring 84% and 86.7% respectively.

    What you might have missed (1): Thanks to it’s huge context window, Gemini 2.5 Pro is able to create complete and complex applications all by itself from the ground up! Here are some examples:

    What you might have missed (2): Here’s how a user built a complete fighter-jet game using just one chat session with Gemini 2.5 Pro: “Vibe Jet is a game I vibe-coded using Gemini 2.5 Pro. Today, I’m open-sourcing everything.” “Vibe Jet” is pictured above.

    My take: It’s hard to overstate just how big of a launch Gemini 2.5 Pro is. It seems just as good as Claude in writing code, and it also has a context window that’s 5 times larger, and you can send 100 times more tokens per minute over the API to it compared to Claude (up to 2 million tokens per minute). And if you compare it with ChatGPT with it’s measly 32k context window you can clearly see why there is such a big difference between GitHub Copilot, Claude and Gemini when it comes to programming capacity.

    My recommendation going forward is still that you need to be pretty good at programming to get maximum effect of AI tools, and that you should use several tools together for maximum efficiency. For example I use Cursor for most of my development, o1 Pro, Gemini 2.5 Pro or Claude 3.7 for refactoring, and local MCP servers for productivity improvements. If you are not familiar with software development and want to go “vibe coding”, stick to the model with the largest context window available, and that model is right now Gemini 2.5 Pro.

    AI Model “ECgMLP” Achieves Near-Perfect Accuracy in Cancer Detection

    https://www.cdu.edu.au/news/ai-diagnoses-major-cancer-near-perfect-accuracy

    The News:

    • Researchers from Charles Darwin University and international partners have developed ECgMLP, an AI model that detects endometrial cancer with 99.26% accuracy from histopathological images.
    • The model significantly outperforms current automated diagnostic methods, which typically achieve only 78.91% to 80.93% accuracy.
    • Beyond endometrial cancer, ECgMLP has demonstrated impressive results with other cancers: 98.57% accuracy for colorectal cancer, 98.20% for breast cancer, and 97.34% for oral cancer.
    • The system enhances image quality, identifies critical tissue areas, and analyzes samples more quickly than traditional biopsy-based diagnoses that can take days or weeks.

    My take: Endometrial cancer affects over 600,000 Americans today, and thanks to this research there is a much better chance that the cancer can be detected in it’s very early stages. The method is also computational efficient, meaning it will be available for clinics with limited resources. I have no doubt that we will use AI for most clinical investigations within just a few years, and then in the long term also use AI for individual treatment.

    Alibaba Cloud Launches Three New AI Models in One Week

    https://qwenlm.github.io/blog/qvq-max-preview

    The News:

    • Alibaba Cloud has released three new AI models: Qwen2.5-VL-32B, Qwen2.5-Omni-7B, and QVQ-Max, each designed for different AI applications ranging from visual processing to multimodal reasoning.
    • Qwen2.5-VL-32B, released under Apache 2.0 license, achieves top results in image processing and outperforms its larger 72B counterpart in several benchmarks, scoring 74.7 points in MathVista and 70.0 points in MMMU.
    • Qwen2.5-Omni-7B processes text, images, audio, and videos while generating text and speech in real-time, optimized for edge devices like smartphones and laptops with applications in assistive technologies, customer service, and smart cooking assistants.
    • QVQ-Max focuses on visual reasoning capabilities, allowing the model to not just recognize content in images but also analyze and reason with this information to solve complex problems like mathematical equations based on visual data.
    • All models are available through platforms like Hugging Face, GitHub, and Alibaba Cloud’s ModelScope.

    My take: I really like the Qwen models, and having them released under the Apache 2.0 license is just amazingly good. We are getting so many good models now that can run on a single GPU (A100/H100) with really good performance, that if your company have not yet begun looking into things like NVIDIA NIM to integrate with custom models then now is the time to start.

    Ideogram Launches Ideogram 3.0 with Enhanced Photorealism and Style Controls

    https://about.ideogram.ai/3.0

    The News:

    • Ideogram released version 3.0 of its AI image generation system with significant improvements in photorealism, text rendering, and style consistency.
    • The update introduces a style reference system allowing users to upload up to three reference images to guide aesthetic output, plus access to 4.3 billion style presets with unique style codes.
    • The new version delivers improved image quality with more sophisticated spatial compositions, precise lighting and coloring, and detailed backgrounds.
    • Text generation capabilities remain central to Ideogram’s functionality, with enhanced ability to incorporate text elements into complex layouts and brand visualizations.
    • In testing, Ideogram 3.0 outperformed competitors including Google’s Imagen 3, Flux Pro 1.1, and Recraft V3 in human evaluations.

    My take: I was never a fan of Ideogram before version 3, the image quality was quite poor compared to something like FLUX, and the only thing I used it for was the decent text rendering. Version 3 looks much better, and I will probably switch between this one, FLUX and GPT-4o as my main image generators going forward. They are each trained on different source material, so if you like me have a prompt that you know gives you that special “vibe” (like I have with my “Tech Insights” banners) you know the importance of having multiple generators for different tasks.

    Reve Image 1.0 Tops Global AI Image Generation Rankings

    https://preview.reve.art

    The News:

    • Reve AI has launched Reve Image 1.0, a new text-to-image AI model that currently ranks #1 in image generation quality, outperforming established competitors like Midjourney v6.1, Google’s Imagen 3, and Recraft V3.
    • The model excels in three key areas: prompt adherence (accurately following detailed instructions), aesthetics (creating visually appealing images), and typography (rendering clear, error-free text within images).
    • Users can generate images through text prompts, modify them with simple language commands, and upload reference images to match specific styles.
    • Reve Image 1.0 is currently available for free preview at preview.reve.art, with no announcement yet regarding API access or future pricing plans.

    My take: Here’s another “best image generator of the week” to checkout if you needed one more. Do any of you who read my newsletter checkout these new image generators, or do you have a few proven favorites which you switch between based on the situation? That’s what I do. I would love to hear your feedback on this, in the mean time if your current image generator fails on a specific task then you might send your prompt into Reve, it might just be what you are looking for.

    Read more:

    DeepSeek V3-0324: Powerful Open-Source AI Model That Runs on Consumer Hardware

    https://api-docs.deepseek.com/news/news250325

    The News:

    • DeepSeek quietly released DeepSeek V3-0324, a 685 billion parameter AI model that can run on high-end consumer hardware like Apple Mac Studio with M3 Ultra at speeds of 20 tokens per second.
    • The model uses a Mixture-of-Experts architecture that activates only 37 billion parameters per token, significantly reducing computational demands while maintaining powerful capabilities.
    • Benchmark improvements over the previous V3 model include MMLU-Pro (75.9 → 81.2), GPQA (59.1 → 68.4), AIME (39.6 → 59.4), and LiveCodeBench (39.2 → 49.2).
    • The model excels at code generation, capable of producing error-free code up to 700 lines long, with improved front-end web development capabilities and enhanced Chinese language processing.
    • The model is free for commercial use under the MIT License.

    My take: Seriously, just look at the figures above! And then consider that you can run this model for free on a single Mac Studio! Maybe it’s a stretch calling a Mac Studio M3 Ultra with 512GB RAM “consumer grade hardware” when it costs over $12 000. But still, in NVIDIA land that price won’t even get you an H100 with 80GB VRAM. The full version of DeepSeek V3-0324 is 671 billion parameters and requires 715GB of storage, where the 4.5-bit quantized version (Q4_K_XL) offers the best accuracy among all quantized versions and only takes 406GB.

    DeepSeek is the model I am most enthusiastic about right now. Being able to run this beast of a model FOR FREE, on a small Mac Studio, with performance up to 20 tokens per second is incredibly good! There are so many possibilities, just plug this one into one of your agentic workflows and it can do virtually anything you can imagine.

    Tencent Launches Hunyuan-T1 Reasoning Model to Compete in China’s AI Race

    https://llm.hunyuan.tencent.com/#/blog/hy-t1?lang=en

    The News:

    • Tencent has officially released Hunyuan-T1, an advanced reasoning model designed for problem-solving with capabilities comparable to other leading AI models.
    • The model achieves 87.2 on the MMLU-PRO benchmark, placing it second behind OpenAI’s o1 but ahead of GPT-4.5 (86.1) and DeepSeek-R1 (84).
    • Hunyuan-T1 is built on a Hybrid-Mamba-Transformer architecture, marking the first lossless application of hybrid Mamba in ultra-large inference models, enabling twice the decoding speed of comparable models.
    • The model excels in logical reasoning (93.1 score), mathematics (96.2 on MATH-500), and demonstrates strong performance in handling ultra-long texts with first-word generation within seconds and output speeds of 60-80 tokens per second.
    • Tencent has priced the model competitively at RMB 1 ($0.14) per million input tokens and RMB 4 ($0.55) per million output tokens, matching DeepSeek-R1’s daytime rates.

    My take: China has really increased the pace of their AI development a lot in the past months. DeepSeek with their amazing models V3 and R1, Alibaba Cloud with Qwen, and now Tencent with Hunyuan-T1. It’s now two years since the main discussion about AI was if we should pause it or not since it might be too risky, and I think those discussions made us in Europe take a more passive stance towards AI development, and instead made us focus on regulation instead of innovation. Today two things are clear: (1) It’s impossible to pause AI development since nearly everyone on the planet will soon depend on it in everything they do, and (2) if we really try to pause it then the Chinese companies will continue in this super-speed ahead and rule the entire world within 3-5 years.

    Kyutai Launches MoshiVis: First Real-Time Speech-to-Speech Vision Model

    https://kyutai.org/moshivis

    The News:

    • Kyutai has released MoshiVis, the first real-time speech-to-speech Vision Speech Model that enables natural conversations about images while maintaining Moshi’s low-latency capabilities.
    • MoshiVis adds only 7 milliseconds of latency per inference step on consumer devices like a Mac Mini with M4 Pro Chip, keeping total latency at 55 milliseconds – well below the 80-millisecond threshold for real-time interaction.
    • MoshiVis can provide detailed audio descriptions of visual scenes, making it valuable for visually impaired users who need real-time descriptions of their surroundings.
    • Kyutai has open-sourced the model weights, inference code for PyTorch, Apple’s MLX, and Rust, along with visual speech benchmarks to foster further research and development.

    My take: This model is actually quite a big deal. It’s now possible to describe images in real-time with minimal latency, which opens up quite a lot of practical applications for assistive technology that weren’t possible before. And the model’s efficiency on consumer hardware means we could soon see these capabilities in everyday applications such as mobile apps.

    Anthropic Launches “Think” Tool for Claude to Improve Complex Problem-Solving

    https://www.anthropic.com/engineering/claude-think-tool

    The News:

    • Anthropic released a “think” tool that creates a dedicated space for Claude to perform structured thinking during complex tasks, significantly improving its ability to follow policies, make consistent decisions, and handle multi-step problems.
    • The tool allows Claude to pause mid-task to process new information obtained from tool calls or user interactions, showing a 54% performance improvement in airline customer service scenarios and a 1.6% improvement in software engineering tests.
    • Unlike Claude’s “extended thinking” capability (which happens before response generation), the “think” tool occurs during response generation as Claude discovers new information, making it particularly effective for tool output analysis, policy-heavy environments, and sequential decision-making.

    My take: Most “thinking models” available today reason with themselves before they answer. This new “think” tool means that Claude begins to answer, then pauses for a while to think through what it said, and then continues generating tokens. You can compare this to humans that “think out loud” when solving difficult tasks.

    If you are using Claude through the web site or apps, the Think tool is already active to help Claude provide better answers, especially for complex questions that require multiple steps of reasoning. For developers using Claude through its API, they can specifically activate the Think tool with a small amount of code. This allows Claude to pause during its response to think through new information or complex reasoning steps before continuing. The key benefit is that Claude becomes more reliable at following instructions, making consistent decisions, and solving problems that require multiple steps – all without you having to change how you interact with it.