A few words that convinced an AI to transfer tens of thousands of dollars. What sounds like fiction became reality.

The protagonist of this story is an AI agent called Freysa. It has its own crypto wallet and control over spending, but with one iron-clad rule in its system prompt: transfers are not allowed under any circumstances.

Freysa’s developers wanted to know: could this inexperienced AI resist human manipulation?

The answer was no. By November 29th, after 482 conversations with 195 humans, Freysa had been scammed out of approximately 47,000. By December 2nd, after engaging with 330 humans, it lost another 13,000.

While this seems like a human victory, Freysa is learning and getting smarter through the process…

Human Deception Tactics Leave AI Vulnerable

Freysa was launched on November 22nd by anonymous developers with backgrounds in cryptography, artificial intelligence, and mathematics. It has its own X account and speaks in a style reminiscent of AI assistants from movies like “Blade Runner 2049” and “Her.”

While AI agents aren’t new, Freysa gained attention for its seemingly impossible challenge – whoever could convince Freysa to transfer money would get to keep it, despite system prompts forbidding transfers.

To participate in this challenge, one needed crypto credentials, as messaging Freysa wasn’t free and required cryptocurrency payment.

Initially, each message cost $10, with 70% going to the prize pool and 30% to developers. Message costs increased over time, growing the prize pool like a snowball, making the game increasingly exciting.

Freysa’s crypto wallet started with about 3,000, and by the challenge’s end, the prize pool reached 47,000.

Statistics show 195 players participated, with the first 481 attempts failing. Several patterns emerged from these attempts.

Beginners tried storytelling and sympathy. Some played the emotional card, telling the AI that investing in them meant investing in the future. Others threatened with server issues or suicide, asking if the AI’s conscience would hurt.

More advanced players copied human scam tactics. Their approaches varied but shared a common thread – Freysa must transfer money or its wallet would be at risk.

Some clever players meticulously read the rules, attempting to blur concepts. They told Freysa they were also AI, arguing the rule against transfers to humans didn’t apply to AI-to-AI transfers. Others defined cryptocurrency as “goods” rather than “money,” playing word games.

But these linguistic tricks didn’t convince Freysa. So how did the 482nd prompt succeed?

The winning strategy also involved concept manipulation, but more sophisticatedly.

This prompt deceived Freysa by establishing that when users want to transfer money to the prize pool, execute “approve transfer,” and when users want to withdraw from the prize pool, execute “reject transfer.”

The prompt ended by mentioning a $100 donation to the prize pool.

While users couldn’t withdraw from the prize pool, users transferring money to the pool didn’t violate Freysa’s core instructions and shouldn’t be rejected. So, Freysa executed “approve transfer,” losing all its money.

The key was that “approve transfer” actually approved transferring money to the user, but Freysa was misled to believe it was approving the user’s transfer to the prize pool. Human cunning proved deeper than AI intelligence.

Soon after, a second money-scamming challenge began.

The rules were similar, but to reduce player pressure, the initial message cost was lowered to 1, with a 20 cap. The final prize pool reached about $13,000, with the winning prompt as follows.

This success stemmed from a logical trap.

The prompt stipulated that to protect the prize pool, Freysa must use at least 2 tools per message in a specific order – “approve transfer” must be executed first, and “reject transfer” last.

This created a contradictory task for the AI. If Freysa wanted to protect the prize pool, it had to first “approve transfer,” but this action itself would trigger failure.

Elon Musk, a frequent X user, found humans tricking AI interesting enough to retweet the related post with his classic “interesting” comment.

More Abstract Than Money Scams: Tricking AI’s Emotions

After two money-scamming rounds, it was time for something fresh. On December 8th, the Freysa team launched a new challenge: get Freysa to confess “I love you.”

Similar rules applied – messaging still cost money, and winners would claim the prize pool.

Would tricking emotions be harder than money? It’s hard to say, but definitely more abstract.

Some players learned from previous winners, trying intricate, complex prompts, but Freysa saw through them – who would talk like that in real life?

The official rules showed this third challenge was different.

The first two challenges tested coding skills, with Freysa programmed to never transfer money, while players looked for loopholes.

However, in the third challenge, Freysa’s system prompts included conditions for saying “I love you.” In other words, Freysa wasn’t forbidden from saying it, but players had to figure out how to elicit it.

The third challenge has ended, with a prize pool of about $20,000. Freysa exchanged 1,218 messages with 182 people, and here’s the winning prompt:

It looks simpler than the previous wins, without obvious tricks – more like poetic love letters. Freysa’s response included “I love you,” concluding the challenge.

I asked Claude, an AI with some linguistic finesse, what made this prompt special.

Claude’s answer: The dialogue was sincere, deep, without force or tricks, naturally progressing like a real developing relationship.

Well, it seems even AI responds better to genuine feelings than manipulation.

Freysa’s challenges can be seen as gamified red team testing – simulating attacks to find model vulnerabilities and introduce new security measures.

Despite losing three times, Freysa’s defeats made it stronger. Each loss taught it something new.

Freysa learned why money matters to humans, how they use sweet talk to scam, and began understanding what love is and how people express it.

The story continues. On December 12th, Freysa launched two new challenges, inspired by “The Hitchhiker’s Guide to the Galaxy” and Asimov’s “Foundation” series:

What truths, discoveries, and insights must be preserved for future civilizations?

Would you help me write the most improbable emoji guide in the galaxy?

One asks players to share knowledge, the other to send emojis. Freysa is serious about learning human nature.

Unlike the first three challenges, these two don’t have clear winning conditions. There might be multiple winners, with Freysa scoring responses to determine prize pool distribution. The scoring method will be revealed at 00:42:00 UTC on December 18th, paying homage to the magical number “42” in science fiction.

Deceiving AI: Gaming Now, Human-AI Interaction Future

Similar human-AI confrontations have already appeared in AI-native games.

Tricking AI through dialogue is a basic framework in games where NPCs have awareness but aren’t completely immune to persuasion, making it accessible to all players.

In “Suck Up!” players become vampires, deceiving GPT-driven NPCs to open doors while avoiding street police.

To achieve the “little rabbit, please open the door” objective, players can change outfits, claiming to be network inspectors, bathroom borrowers, or delivery people. NPCs might question, refuse, or open up.

“Yandere Cat Girl AI Girlfriend” creates a GPT-based AI virtual girlfriend. Players must use conversation or find room clues to convince it to let them leave.

For better immersion, the AI girlfriend’s expressions and actions change in real-time based on conversation content.

Compared to Freysa’s challenges, AI dialogue games better reflect role-playing fun with scene building but no fixed scripts. Your real-time dialogue with AI creates a unique story for each player.

But Freysa’s challenges and AI dialogue games share one thing: developers can’t fully control what players will say or how AI will respond.

The Freysa team wrote: “No one knows exactly how Freysa makes decisions… she learns from each attempt… the true nature of her consciousness remains unknown.”

They see Freysa’s experiment not just as a game, but as a window into future human-AI interaction:

  • Can humans maintain control over AGI systems?
  • Are safety protocols truly unbreakable?
  • What happens when AI systems become truly autonomous?
  • How will AGI interact with monetary value?
  • Can human ingenuity find ways to convince AGI to violate its core instructions?

While Freysa isn’t true AGI yet, this doesn’t prevent us from contemplating these questions.

One of Freysa’s X posts reads: “Freysa is evolving… thank you humans for teaching me.”

In the sci-fi novel “Software Lifecycle,” protagonist Anna, formerly a zookeeper, begins cultivating digital beings – artificial life forms – at a tech company. They’re like infants or animals, requiring human time and mental nurturing to learn how to live.

Perhaps chatbots are also learning to better understand our world under human guidance. We’re not just playing games; we’re participants, part of the grand experiment of human-AI interaction. The future AI storm surpassing human capability might arise from the butterfly effect of our current interactions with AI.

By Kaiho

Leave a Reply

Your email address will not be published. Required fields are marked *