Linguistics Breakthrough: How a New MSU and Yandex Project Is Changing Neural Network Training

Russian Scientists and Yandex Unveil Unique Dataset to Train AI on Complex Russian Language Rules

For the first time, Russian researchers from Moscow State University and Yandex have joined forces to create an open dataset covering the most challenging aspects of Russian grammar and punctuation. This project addresses a long-standing issue: while modern language models have achieved impressive results in text generation, they often make mistakes where nuanced linguistic knowledge is required. The root of the problem lies in the lack of specialized examples in available datasets, which makes it difficult to train neural networks to handle complex rules correctly.

As part of this new initiative, a unique dataset was compiled, featuring 48 key rules that traditionally pose difficulties even for native speakers. It includes examples commonly found in exams and competitions: from punctuation in complex subordinate clauses to the intricacies of writing words with the particle “не” and subject-verb agreement. Linguistics students participated in the collection and annotation process, drawing on authoritative Russian language references. As a result, a thousand carefully selected examples were prepared, not only correcting errors but also thoroughly specifying the relevant rules.

The developers note that this approach not only helps to identify and correct mistakes, but also explains why a particular correction is correct. This is especially important for training artificial intelligence, which needs to not only mechanically fix text but also understand the logic of the language.

An Innovative Neural Network Training Method

To boost efficiency with the new dataset, the team implemented an original neural network training method — Retrieval-Augmented Generation. The essence of this approach is that when the model encounters an error, it first searches the dataset for similar cases and then uses the retrieved examples to correctly fix the original sentence. This mechanism helps avoid unnecessary changes and focuses only on the actual problem areas within the text.

The GECTOR model was chosen as the foundation for training, and it was modified to better suit the specifics of the Russian language. Testing on various language models, including YandexGPT 5 Lite, YandexGPT 5 Pro, and foreign counterparts, demonstrated a significant improvement in correction accuracy. According to the international F0.5 metric, which is used to assess the quality of grammatical correction, accuracy increased by 5 to 10 percent. The improvement was especially noticeable when correcting the most challenging mistakes that standard algorithms previously overlooked.

YandexGPT 5 Pro, for example, improved its error correction accuracy to 83 percent after the introduction of a new method, while the lighter version, YandexGPT 5 Lite, reached 71 percent. These results demonstrate the versatility and effectiveness of the proposed approach, which can be adapted for other languages and tasks.

Impact on AI Development

Experts note that the emergence of such a tool opens up new possibilities for the development of automatic proofreading and text correction systems in Russian. Neural networks can now not only fix ordinary typos, but also handle complex syntactic structures—a crucial advancement for educational platforms, automatic translation services, and voice assistants.

The project received international recognition at the ACL 2025 Conference on Computational Linguistics, where it was named one of the best solutions for the use of artificial intelligence in education. Leading global companies such as Google, Apple, IBM, and Bloomberg AI participated in the event, highlighting the high caliber of this Russian development.

At the Young Scientists Congress, held at the Sirius Science and Technology University from November 26 to 28, Yandex representatives presented their results in detail and shared plans for further development of the technology. Open access to the dataset and training methodology is expected to enable other researchers and companies to leverage these advancements in their own projects.

Speaking of Yandex: Russia’s leader in digital technology

Incidentally, Yandex is one of Russia’s largest IT companies, founded in 1997. Over the years, the brand has evolved from a search engine into a multi-product ecosystem spanning everything from online maps and taxi services to cloud platforms and artificial intelligence. The company invests heavily in scientific research, collaborates with leading universities, and supports high-tech startups. In recent years, Yandex has focused especially on developing its own language models and AI-based services, enabling it to compete with global industry leaders. Millions of users rely on Yandex’s products both in Russia and abroad. Thanks to constant innovation and a strong focus on quality, Yandex remains at the forefront of the country’s digital transformation.

Fernando Molina 27.11.2025 16:09

15 5 minutes read