HomeBlog >
How MXT-1.5 Performs Against Leading AI Models
< Home
Insights

How MXT-1.5 Performs Against Leading AI Models

By
Moments Lab Content Team
October 2, 2024

The Moments Lab Research team rigorously tested our multimodal AI model, MXT-1.5, using the VideoMME dataset as a benchmark, and we’re excited to share that it outperforms major models including GPT-4o, Google Gemini 1.5 Pro, and Nvidia VILA 1.5.

As a leading AI and video search company, we are often asked how our solution compares with popular AI models. By benchmarking our technology against the latest systems, we can better understand its strengths, identify areas for improvement, and shape our future developments. 

Over the past few months, our research team rigorously tested MXT-1.5, our multimodal AI model, using the VideoMME dataset, a respected benchmark in the video AI community. After thorough evaluation, we’re excited to share some compelling results.

Powerful AI for Complex Video Understanding

MXT-1.5 was designed to solve complex video understanding problems. It helps content creators quickly find specific moments in vast audiovisual libraries, enabling them to create more content faster.

What makes MXT-1.5 stand out is its unique approach: instead of relying on a single system, it combines multiple expert models, most of which are non-generative. It uses a three-level hierarchical indexing framework that works as follows:

  1. Shot Understanding – analyzing individual video segments.
  2. Shot Grouping into Sequences – organizing segments into coherent chapters.
  3. Overall Video Summarization – producing concise, high-level summaries of entire video files.

This structure, while different from what other AI providers use, means MXT-1.5 can analyze even the most complex video content with exceptional accuracy.

Leading in Long Video Processing and Specialized Content

The result of the benchmark testing was significant. MXT-1.5 performed well overall, and even outperformed major models such as GPT-4o, Google Gemini 1.5 Pro, and Nvidia VILA 1.5. It did especially well in processing long-form videos (30 minutes or more), a known challenge in AI.

“This evaluation confirms that combining generative models with expert AI systems creates a more robust technology. Our approach not only delivers more detailed results, but also improves explainability.”

Dr. Yannis Tevissen, Head of Science, Moments Lab.

VideoMME benchmark showing MXT-1.5’s score on long video understanding compared with leading AI models.

These results confirm our belief that top-tier video understanding requires a combination of specialized models, particularly non-generative ones. MXT-1.5’s three-level hierarchical indexing is a key advantage, enabling it to outperform leading GenAI models, especially in the critical task of processing long-form videos. This indexing approach is especially effective in categories such as sports and television videos, showcasing our industry-specific AI training, the impact of recent improvements, such as sequence generation, and solidifying our leadership in these areas.  

What’s Next for Moments Lab?

We’re naturally thrilled with MXT-1.5’s performance, however this is just the beginning. Our team is already diving deeper into the results and will conduct similar evaluations of the quality of our semantic search engine against other open-source search models. 

The work our team has put into MXT-1.5’s capabilities serves as a firm foundation for what’s to come. We’re working on a groundbreaking new tool that will help content producers build rough cuts even faster.

Dive into more details about our MXT-1.5 benchmarking here.

Moments Lab pour votre organisation

Contactez-nous pour une démo et un essai gratuit de 7 jours.

C'est parti →