
"Attempts to destabilize society coming from abroad remain a serious problem"

Roskomnadzor uses several automated systems to monitor the Internet and identify information prohibited by law. With their help, it was possible to significantly increase the speed and accuracy of the search for illegal content, from the sale of drugs and calls for suicide to extremist materials and child pornography. Vadim Subbotin, Deputy head of Roskomnadzor, told Izvestia in an interview about how these systems work.
— It is obvious that the volume of destructive content is growing in proportion to the growing annual amounts of information that users post online. When did it become clear that with the help of manual monitoring alone it was no longer possible to cope with such a scale of "prohibition"?
— Back in 2012, we talked about only three types of the most dangerous information — child pornography, drug propaganda and calls for suicide. But very little time has passed, and the list has expanded significantly. Information has been added aimed at involving children in the commission of crimes, information on methods of making explosives, propaganda of sex reassignment, etc. Simply put, there are so many information threats that the eyes of the monitors on duty and signals from vigilant citizens alone are no longer enough. Moreover, we must block some of the prohibited information very quickly, for example, calls for terrorism.
The agency's response time to illegal content from the moment it appears now ranges from a few minutes to six hours. The context also means a lot: there are materials that experts need to take a deep dive to evaluate. This may take some time, but, as practice shows, the correctness of the decision to block information depends on such an expert assessment. And in the end, I emphasize, it is always accepted by experts — our staff psychologists, linguists, and art historians.
— What kind of automated systems are in the department's arsenal?
— To monitor online media, we use an Automated monitoring system for mass media. An automated monitoring System for television and Radio broadcasting is used to monitor materials on television and radio channels, respectively. The Clean Internet, which consists of various modules that analyze text information and transcribe audio, is responsible for searching for illegal content on Internet sites and in social networks. The systems use various neural network technologies, and there are developments that show certain results in automated image and video analysis (for example, the Oculus module).
— According to Brand Analytics, users are currently posting 3 billion messages per month on social media alone. And with such volumes of content, how do your systems determine what to grab first?
— In fact, they mimic user behavior. At first, the Internet space is "vacuumed" by crawlers, or search robots, which are in some way our "ordinary assistants", for whom the agency's experts make up requests. The information found then goes through several verification stages.
On the first page, duplicate and deleted content is filtered, leaving only accessible and unique content. Then the stage begins, which can involve various modules of our analytical systems, individually or together. For example, this is linguistic dictionary analysis, which uses regular expressions to search for text matches. Or a single analysis module (EMA), which uses neural network models to search for semantic signs of violations in the text.
I'll explain with an example. Let's say the system downloaded a post from a social network with text and an image that contains an advertisement for a drug marketplace. The linguistic dictionary will find certain markers, such as the name of the marketplace and related keywords. The EMA will confirm the contextual content, and Oculus will identify the marketplace logo and link to it on the image.
— What is the accuracy of identifying prohibited information?
— Due to the comprehensive work of the modules on most types of prohibited information, we managed to achieve high detection accuracy — up to 98% for types of prohibited information that are particularly dangerous to citizens. At the start of the system, this figure reached only 10%.
The key role of AI in monitoring is to analyze information and reduce the burden on operators by screening out materials without signs of violations. On average, the automated system downloads about half a million relevant materials per day. After consistent analysis by the system and processing by the operator, about 2,000 materials with violations of the legislation of the Russian Federation remain. The automated "sieve" allows operators to focus on more complex and in-depth expertise tasks. As a result, work efficiency increases significantly, and information processing costs are significantly reduced.
— But we can't say yet that "AI is always right"?
— We can't. Yes, false positives also occur during the operation of the system due to ambiguity of language, polysemy of words and contextual nuances, which our models can - in rare cases — interpret incorrectly. However, the proportion of false positives is extremely low, which is achieved through an integrated approach to training, including the use of a variety of datasets, regular validation and the introduction of post-processing mechanisms for results.
As you can imagine, we have created a unique set of systems and algorithms, each of which integrates specialized linguistic dictionaries and adaptive search algorithms. The process of developing such systems is a very delicate and complex engineering job, where you need to understand the features of content and multidimensional architecture settings. Firstly, a huge amount of data requires scalable and distributed storage, which complicates the architecture of the system. Secondly, it is necessary to have high-quality and sufficiently voluminous datasets for training. Over the years of working with prohibited information, we have accumulated enough of them. Thirdly, training models on such data takes considerable time and computing resources, especially since it is necessary to constantly update models due to the variability of the data.
It was extremely important to integrate machine learning, which is able to independently identify new patterns and contexts, providing more flexible and accurate text processing. Many linguistic phenomena, such as sarcasm, irony, hidden appeals or emotional overtones, cannot be understood without taking into account the surrounding text and situation. Machine learning using contextual models takes into account the sequence of words and their interrelationships, which means that the quality of understanding the meaning increases and the number of errors in interpreting complex statements decreases.
Another important aspect is the explainability, or interpretation, of the model results. It is important to understand why the model made this or that decision, which signs were the most significant. Our system implements interpretation methods, which allows operators to obtain transparent and understandable explanations of how models work in complex categories of prohibited information, increasing trust and making it easier to identify errors. By the way, we are currently working with about 30 types of prohibited information.
— What new threats and challenges in the field of dissemination of prohibited information on the Internet do you consider to be the most relevant at the moment and how does Roskomnadzor plan to adapt its monitoring systems to counter these threats?
— Attempts to destabilize society coming from abroad and using the separatist agenda remain a serious problem.
Now we see how there are stuffing on this topic, propaganda of exclusivity on national and religious grounds, and the opposition of one group of people to another on the same grounds.
Here we additionally use the Vepr information system. It implements an innovative approach to identifying so-called information tension points. The system intelligently evaluates changes in the information agenda across the country and for individual regions. Analyzing large amounts of publications on the Internet, Vepr identifies acute information flows that can transform into information threats. This happens on a near-real-time scale, which makes it possible to quickly take measures to neutralize threats. When I say "close to real", I mean that the system outputs data in a few minutes.
"Boar" is in continuous development. In the future, it will automatically detect signs of destructive information campaigns: situations where one or another information occasion is artificially planted and promoted to have a destabilizing effect on certain categories of citizens, children and society as a whole. As a result, there is an increase in the speed of response to such threats.
Another threat is attempts to destroy traditional Russian spiritual and moral values. For example, everything related to the value of life can be gradually leveled by systematic destructive informational influence through the "routine" horror of death, radical cyberbullying with threats of physical violence, the introduction of delayed suicidal attitudes, and the propaganda of destructive cults. At the same time, constant information pressure through the justification of collaboration, the dissemination of fabricated documents discrediting the biographies of national heroes, or the revision of our historical heritage, when the Soviet period is presented as the "era of repression", achievements in science, culture, defense, and the social sphere are ignored, can gradually destroy the value of patriotism. When certain quantitative indicators are achieved, such as the volume of destructive information, its frequency, duration of exposure and its coverage, the foundations are formed for the full-fledged destruction of the traditional Russian value system or its elements.
And here we must work, among other things, on big data analysis systems to determine when a potential threat escalates into a real one, requiring government intervention to protect the lives and health of Russian citizens and Russia's national security.
In the near future, artificial intelligence and neural network technologies will make it possible, among other things, to accurately capture the qualitative characteristics of destructive information, such as the semantic aspect of the narratives being embedded, the degree of their "destructiveness" for the human psyche, the depth of negative emotions provoked by them, and the presence of psychological manipulation methods.
All of these technologies are based on advances in AI and the latest advances in natural language processing, machine emotion recognition, and semantic analysis. We must not only keep our finger on the pulse, but also act ahead of the curve, including predicting the emergence of possible threats.
Переведено сервисом «Яндекс Переводчик»