
Training is the plague: why training neural networks can be dangerous

Last week, it became known that the Russian government is going to test artificial intelligence models that have been trained on state data for threats to state security and national defense. What's interesting about this news is the very assumption of training neural networks on state data. The question of what information to work with is not an idle one: using real documents, personal data, and information sensitive to a company or the state for this purpose can lead to serious consequences. Izvestia looked into how artificial intelligence training is regulated now and why it should be closely monitored.
Will AI be trained on state data?
In the period from 2025 to 2026, the Russian government plans to conduct research and development of principles for analyzing AI models that are trained on state data. A program that will analyze such models will then be implemented. It is planned that five systems will receive confirmation of "acceptability of safe use" by 2030. For these purposes, 8.1 billion rubles will be allocated until 2030, and the FSB is responsible for the implementation of the project.
The Ministry did not say what kind of state data we are talking about, saying only that it will "support the development of AI, including within the framework of the new national project".
Before that, Deputy Prime Minister Dmitry Grigorenko instructed the Ministry of Digital Data together with the Big Data Association (BDA) to work out a procedure for providing businesses with access to government data. Among them - information on passports, labor activity of citizens and phone numbers. So far, according to the ABD, there is no talk of training neural networks on this data, but this step may be the next.
Experts have already expressed concerns about possible leaks of state data when training neural networks.
At the same time, just the other day President Vladimir Putin signed two federal laws tightening the responsibility for leaks. Now sanctions for companies can reach 15 million rubles. In addition, there may be negotiable fines depending on the company's total annual revenue for the previous year. And Article 272.1 has been introduced into the Criminal Code, which deals with the illegal storage and distribution of personal data.
Why it's dangerous to train AI on real data
Roman Dushkin, chief architect of artificial intelligence systems at the AI Research Center for Transport and Logistics at MEPhI National Research Nuclear University, notes that training on real documents can be a bad idea because they are somehow uploaded into the AI system.
- "Many studies show that the data inside the neural network is stored, and when using certain techniques - be it prompting or generative-adversarial attacks - they can be retrieved," he told Izvestia.
Dmitry Fedotov, the architect of the AiLine Softline Digital (Softline Group) AI platform, explains that some details can be recognized by indirect signs thanks to the competent construction of queries to the generative AI. And sometimes it can be done by accident.
Anton Nemkin, a member of the State Duma Committee on Information Policy, Information Technology and Communications, notes that to minimize these risks, strict security measures and ethical standards must be observed: data must be anonymized and encrypted before training the model.
- And it is important to implement mechanisms that prevent the "memorization" of confidential information and develop methods to protect against privacy attacks," the MP said.
Sergey Galeev, head of the backend department of SimbirSoft IT company, pointed out that both large players and small startups try to use the most relevant and voluminous datasets for AI training: open data from the Internet, the company's own databases - from purchase history to user behavior on the site, as well as closed data obtained under contracts from partners or customers.
- However, companies usually hide what specific information they use to train commercial AI systems, citing trade secrets and competition risks," he told Izvestia.
However, Tatiana Zobnina, head of Naumen's data analysis and machine learning department, is sure that no serious company will violate either license agreements, personal data legislation, or the contract with the customer. After all, all this will lead to reputational and financial risks.
How can AI be trained to work with documents?
Periodically, accusations that neural networks are trained on personal data are made against various companies. For instance, a major startup Dbrain recently faced these accusations. It was claimed that the company allegedly not only trains its programs on the real passports of citizens transferred by microfinance organizations, but also uses this information for the operation of services for automated document verification of crowdsourcers, i.e. live people who receive other people's passports for "pennies" for verification. Dbrain denied the accusations against it.
- We would like to emphasize that this information is inaccurate and does not correspond to reality. We strictly adhere to all standards of personal data processing, including the requirements of No. 152-FZ, and regularly undergo compliance audits both from our customers and independent organizations. This is one of the highest priority components of our work," Alexey Khakhunov, founder of Dbrain, told Izvestia.
Meanwhile, says Roman Dushkin, crowdsourcing on the market is indeed actively used by many companies for data markup before AI training. In recent years, separate platforms have emerged, crowdsourcing has grown into a big industry and a sub-branch in this area.
- Who gets the data from these platforms is not known at all. Therefore, the information security service at enterprises must carefully monitor how models are trained and where data can be transferred. In large Russian companies, such as Rosatom, this is very closely monitored," Roman Dushkin emphasized.
Dmitry Fedotov notes that industrial companies have their own requirements and standards in the field of information security, which must be met by the product implemented by the contractor. And it is always checked by the information security service.
Alexey Khakhunov added that his company, when training artificial intelligence models, uses either synthetic documents, i.e. fully generated ones, or works in the "loop" of the customer, when the product is fully integrated into the customer's infrastructure and the data does not leave its limits. For example, Dbrain's anti-fraud system, which is supposed to recognize forgeries among documents, is trained on synthetic data with generated artificial documents that mimic real ones, as well as directly on forged documents that customers send, the company's founder said.
Vladimir Arlazarov, the CEO of Smart Engines, Doctor of Technical Sciences, said that the company uses specially created models of fake documents on real media. Such forgeries are photographed and virtual objects are generated on the basis of the image to train the neural network.
- This solution shows excellent results and is fully within the law. The only task of any anti-fraud system is to recognize the generated data that has been substituted for real data, so that the synthesis of information reflects the logic of fraud itself. In this case, it is necessary to train AI on synthetic, as it were, "fake" samples, - said Vladimir Arlazarov.
How AI training is regulated now
Katerina Tikhomirova, Ph.D., Professor of the Department of Philosophy, Ontology and Theory of Cognition, leading expert of the Laboratory of Digital Technologies in Humanities at MEPhI, believes, however, that synthesized data are suitable for training the model only at the first stages of development. Then we will have to use real ones, the Izvestiya interlocutor believes.
According to her, there is no end to the data for training: the problem is that there are not enough marked-up texts, and there are ethical and legal restrictions on access to new information.
- If the field is not regulated at the state level, scandals about data leaks will accompany the work. The first law on personal data protection has already been passed. We need another one or amendments to the first one to prohibit the transfer of information by companies developing AI models," says Katerina Tikhomirova.
At the same time she is sure that the creation of "Russian sovereign AI" is possible only if the state provides access to data and pays for the labor of markup developers.
Maria Gordenko, academic head of the master's program "Data Analysis in Development" at the Computer Science Department of the National Research University Higher School of Economics, notes that now there are only recommendations and legislative initiatives for training neural networks in the country. In particular, a Code of Ethics was adopted in 2021, which emphasizes that AI actors must comply with Russian legislation and use high-quality and representative datasets obtained without violating the law. So far, its enforcement is not mandatory. But now state standards in the field of AI are already being developed, says Roman Dushkin. In his opinion, such standards should become mandatory for critical areas.
Anton Nemkin notes that Russia is now studying the experience of the EU and China in regulating AI technologies. The new rules should be reflected in the Digital Code, which the Ministry of Digitization is working on. According to the deputy, the regulation should set minimum standards for the use of data for training neural networks, ensuring their security and users' rights in terms of information control. However, verification will require technical solutions and international cooperation, he noted. Anton Nemkin also advocated the development of certification mechanisms for companies developing neural networks and the creation of independent regulatory bodies and structures.
However, Vladimir Arlazarov notes that even the laws "On Personal Data", "On Trade Secrets" and "On the Basics of Public Health Protection" are not always observed by companies, so there is no point in talking about additional regulation today.
- Besides, any special law on the rules of AI training will inevitably become obsolete a few years after its adoption, because technologies do not stand still," said the Izvestia interlocutor. - In order to reduce the risks associated with information security, it is important not to adopt new laws, but to observe and modernize the existing ones.
At the same time, Alexey Khakhunov from Dbrain is sure that the market "is striving for maximum transparency".
Переведено сервисом «Яндекс Переводчик»