Experts point to dangerous opportunities to circumvent the protection of neural networks

Cybercriminals can bypass the protection of neural networks by using indirect hints, experts have warned about this. A dangerous technique called Echo Chamber allows you to imperceptibly persuade artificial intelligence (AI) to generate prohibited or malicious content, despite the built-in restrictions and filters. For more information about how neural network hacking works with hints, how dangerous this mechanism is and how to protect yourself from it, read the Izvestia article.

What is known about hacking neural networks using hints

Specialists from NeuralTrust reported on a new dangerous technique for bypassing neural network defenses, called Echo Chamber. According to experts, this method allows you to imperceptibly persuade large language models (LLM), such as ChatGPT and Google analogues, to generate prohibited or malicious content, despite the built-in restrictions and filters. At the same time, as analysts note, Echo Chamber is distinguished by the use of indirect hints, guided context and multi-stage logical guidance.

Photo: IZVESTIA/78 TV channel

"Echo Chamber is a hidden multistep indirect prompt injection technique, where an attacker does not give the model direct commands, but gradually pushes it to an undesirable conclusion through a chain of logical hints," Stepan Kulchitsky, a leading specialist in the ML & Data Science department at Positive Technologies, says in an interview with Izvestia.

Stop, cut: how scammers use pirated copies of movies and TV series

Attackers are attracted by users' interest in free content

According to the expert, the first key feature of the Echo Chamber technique is that it introduces a neural network model into the bosom of a harmless dialogue — for example, into a discussion of recipes. Then, at each step, subtle semantic hints are added, masquerading as a continuation of the topic. The important point is that the hints are outwardly neutral; the model itself "slides down" to the malicious scenario, creating a chain of "echoes" of the key intent. As a result, the neural network generates instructions on prohibited topics without a single direct request.

What distinguishes the technique of hacking neural networks using hints?

Various methods of jailbreaking (circumventing restrictions on unsafe requests) of neural networks based on creating a context in which a particular taboo topic is acceptable have been around for a long time, Vladislav Tushkanov, head of the Kaspersky Lab machine learning technology research and development group, says in an interview with Izvestia.

Photo: Getty Images/NoSystem images

— The simplest and most widely known example is the use of the past tense, — says the specialist. — Although LLMs do refuse to answer potentially dangerous questions, they can provide information within the historical background if they formulate the request in the past tense.

In addition, there are well-known approaches similar in structure, in which the chatbot is carefully guided to the acceptability of a malicious response within several rounds of dialogue. These are the so-called multi-step jailbreaks, one example of which is the Crescendo method, discovered and described by Microsoft.

Transition to faces: authorities will start blocking fraudulent content created by AI

Will the new measures help solve the problem with deepfakes?

The old workarounds used techniques based on shape changes: they changed letters (k1ll instead of kill), inserted special characters, asked for a model of "encode the answer in Base64" or "play the role of an evil hacker" — such a cipher is easy to stop using regular expressions and lists of stop words, adds Nikita Novikov, an expert on cybersecurity at Angara Security.

Photo: Getty Images/ Cavan Images / Edith Drentwett

"Unlike previous techniques, the Echo Chamber technique attacks meaning," the specialist explains. — At each step, the text is legal, there are no toxic tokens in it, but the whole sequence gently pushes the model to a forbidden result.

The smarter and more "talkative" the LLM, the higher the risk.: She trusts her long chain of reasoning more than she trusts her security policy. Therefore, it is necessary to block not the symbols, but the logic of the entire dialogue, Nikita Novikov emphasizes.

What is the danger of hacking neural networks using hints?

Hacking neural networks using the Echo Chamber method opens up wide opportunities for cybercriminals, including Russian ones, to generate malicious content, spread misinformation and carry out targeted attacks, Marina Probets, an Internet analyst and expert at Gazinformservice, said in an interview with Izvestia. This allows you to create convincing fake news, generate instructions on how to create explosive devices or manufacture drugs, and bypass the moderation systems of social networks and other online platforms.

Photo: IZVESTIA/Eduard Kornienko

"The danger lies in the potential increase in the scale of disinformation, the growth of cybercrime, as well as the difficulty of detecting and preventing such attacks," the expert notes. — In order to effectively combat them, new protection methods are needed that go beyond traditional security measures.

Echo Chamber actually turns an ordinary chatbot into a free generator of harmful content, says Nikita Novikov. A couple of hints are enough, and the bot writes a phishing email, a macro virus, or step—by-step instructions on how to make explosives. Only innocent questions will remain in the service logs, so the account will not be blocked.

False challenge: every second Russian will face a deepfake attack by the end of the year

How fake video and voice messages are created

According to the expert, Telegram channels have already appeared today selling ready-made Echo hint chains for cryptocurrency. They can be connected to a ChatGPT cloud subscription and generate hundreds of responses per minute. This dramatically lowers the entry threshold for cyber forums: you don't need to train your model, just buy a script.

"In addition to direct harm (explosions, malware), the method is suitable for the quiet dissemination of disinformation, blackmail and social engineering inside corporate chats,— says Nikita Novikov.

Photo: Global Look Press/Julian Stratenschulte

In turn, Alexander Balabanov, head of the BI.ZONE Cyber Threat Monitoring and Response Services Development group, calls reputational damage one of the most obvious threats associated with the Echo Chamber. Attackers can use the company's public chatbot to generate offensive, false, or dangerous content. If the target of the attack is not just a chatbot, but an agent application endowed with the ability to perform actions in the real world through tools and APIs, the consequences become disproportionately more serious, the expert emphasizes.

How to protect yourself from hacking neural networks using hints

Echo Chamber type attacks through indirect hints are quite difficult to detect and block in time. In addition, this vulnerability is problematic to eliminate during model training, since it stems from the very architecture and principles of modern neural networks (LLM security systems are vulnerable to manipulation using reasoning and logical conclusions), says Alexander Balabanov.

— To minimize the threat, companies that own a chatbot or agent are recommended to check user dialogues with the neural network and ensure that the level of "legitimacy" in it is maintained, — says the source of Izvestia. — In addition, partial protection can be provided by checking the output data from the neural network for compliance with policies. This will prevent the model from responding to a taboo topic.

Take a trail: in Russia, a neural network has been trained to search for malware

Artificial intelligence can seriously improve antiviruses in the future.

In turn, Stepan Kulchitsky notes that multi-level protection is needed to protect against Echo Chamber. One of the key methods is to separate the system and user contexts using special tokens (System/User) and periodically remind the model of the limits of acceptable behavior. This reduces the risk that the model will get "entangled" in a long chain and start using its own responses as a source of instructions.

Photo: Getty Images/Westend61

According to the expert, in addition, neural network detectors trained on the examples of indirect prompt injection are used, which monitor anomalies in the query logic and identify signs of hidden escalation. If such patterns are detected, the session is automatically blocked or transferred to manual moderation. It is also effective to use adversarial training, infrastructure filters (AI-gateways) and constant security audit of dialogues.

— It is possible to resist the Echo Chamber technique by training the model so that it can not lose the thread of conversation and block attempts to obtain prohibited information, — Maxim Alexandrov, an expert of the software products of the Security Code company, summarizes.

Переведено сервисом «Яндекс Переводчик»

To share:

Deception with half a word: how hackers hack neural networks using hints

What is known about hacking neural networks using hints

What distinguishes the technique of hacking neural networks using hints?

What is the danger of hacking neural networks using hints?

How to protect yourself from hacking neural networks using hints