[ kalizi.dev ]

I attacked a ModernBERT with a genetic algorithm. It worked.

Every hacker, at some level, is someone who decided not to take "secure" at face value.

Kevin Mitnick broke into the FBI's own systems while they were trying to catch him, then left a package on-site that said "FBI Donuts." Frank Abagnale fooled an entire airline's identity verification at 16. Every system has blind spots. Blind spots are where adversaries live.

For my Master's thesis at the University of Palermo, I decided to be the adversary. Not for a financial institution or a government network. For a state-of-the-art NLP model: a fine-tuned ModernBERT on the IMDB movie reviews dataset.

The result: 100% attack success rate. An average of 18 queries to the model. Zero semantic degradation.

Here's how it works.

The crack in "robust" NLP

Modern language models are genuinely impressive. ModernBERT achieves 93.9% accuracy on IMDB sentiment classification after a 3-epoch fine-tuning. That sounds robust. It isn't.

There's a structural vulnerability that adversarial ML research has been exploiting for years: models learn patterns from training data, not meaning.

Here's a simple illustration. Take a 10-word text where every word has at least 2 synonyms. Synonyms are semantically equivalent. You can substitute them freely without changing what the sentence means. That gives you 3^10 - 1 = 59,048 semantically equivalent versions of the original text. The model was trained on a large corpus. Almost certainly not on all 59,048 variants.

Some words are frequent enough to be well-represented. Their synonyms may not be. A word and its synonym can land in completely different regions of the model's internal representation. Even though they mean the same thing.

This is the core issue: semantic equivalence doesn't guarantee classifier equivalence. The question is how you find the right substitution efficiently, automatically, and without turning the text into something nobody would actually write.

Evolution as an attack strategy

This is where genetic algorithms come in.

Genetic algorithms are search methods inspired by biological evolution. You maintain a population of candidate solutions (called chromosomes), evaluate each against a fitness function, then iterate through two operations:

Mutation: small random changes to a candidate
Crossover: combining characteristics of two parent candidates

Over generations, the population converges toward better solutions. In this case, texts that preserve meaning but flip the model's classification from positive to negative (or vice versa).

The core of my implementation operates on bigrams: pairs of adjacent words, not individual tokens. This is a deliberate choice. Bigrams explore the search space faster than single-token substitutions and better reflect how language actually shifts: context changes in chunks, not in isolated words.

The hard constraint: semantic honesty

Any adversarial example that changes the meaning of the original text isn't demonstrating a model vulnerability. It's just generating noise. The attack only means something if the perturbation is genuinely semantically equivalent.

Frozen vocabulary. Not everything is eligible for substitution. Proper nouns, numbers, punctuation, and unrecognized tokens are locked. The attack only operates on content words that can actually be replaced without altering what the text says.

Syntactic and morphological consistency. Every candidate substitution goes through POS-tagging via spaCy. If the original word is a plural noun, the replacement must be too. If it's a verb in the third person singular present, the synonym gets inflected to match, using pyinflect on top of spaCy's morphological analysis. This is where naive synonym-replacement approaches collapse: they ignore that "play" and "plays" are syntactically distinct even though they're the same verb.

Synonyms are sourced from WordNet: not just a list, but a graph of semantic relationships with typed edges (hypernymy, hyponymy, antonymy, entailment). This matters because proximity in embedding space doesn't mean semantic similarity. Word2Vec is famous for putting antonyms near each other. WordNet doesn't have that problem.

Semantic similarity checking. After each mutation, a paraphrase-MiniLM-L6-v2 model measures cosine similarity between the original and perturbed text. Drop below the configured threshold: perturbation rejected. This catches the subtle drift that accumulates when you substitute multiple words. Individually valid synonyms can compound into something semantically wrong. The model was specifically trained on paraphrase pairs, which makes it particularly well-suited for this task.

Hierarchical sentence targeting. For multi-sentence inputs, the attack doesn't spray modifications across the whole document. It scores each sentence by its individual contribution to the classification probability, then concentrates mutations on the most impactful sentence first. This is key to the efficiency numbers.

The results

I benchmarked against 9 state-of-the-art methods from the TextAttack framework: BAE, Checklist, DeepWordBug, Faster Alzantot, Input Reduction, Pruthi, PWWS, TextBugger, and TextFooler.

The composite Global Score weights attack success rate (40%), query efficiency (20%), and semantic preservation (40%).

Attack	Global Score	Success Rate	Avg. Queries
custom (this work)	97.2	100%	17.9
Pruthi	89.8	89.4%	3,242
Checklist	86.9	93.8%	260
BAE	66.2	-	324
Faster Alzantot	-	-	10,677

The efficiency gap is the number that matters most. Faster Alzantot is also a genetic algorithm. It needed 10,677 queries on average. This implementation needed 18. The difference comes from three things: hierarchical sentence selection, random bigram sampling instead of scoring every candidate, and early stopping the moment the label flips.

The 100% success rate against a 93.9%-accurate model is significant. Most pre-ModernBERT attacks already showed high failure rates against it. ModernBERT's larger context window (8192 tokens vs. BERT's 512) and improved architecture genuinely improved robustness. It held up against most of the competition. It didn't hold up against this.

Everything runs on consumer hardware: Ryzen 9 5900x, RTX 3080, 128GB RAM. No datacenter, no API credits, no special access.

What this actually reveals

The obvious reading: even state-of-the-art models have exploitable blind spots. That's true, and it's not trivial.

The deeper reading: when a text and its semantically equivalent adversarial version receive different classifications, the model is responding to surface-level word distributions learned from training data. Not to meaning. The semantic content is the same. The prediction is not.

This is a structural property, not a bug. And it has direct consequences for any production NLP deployment where an adversary has an incentive to flip a classification:

Sentiment monitoring for market or political intelligence
Content moderation systems
Medical or legal text classification
Recommendation systems that rely on automated text quality assessment

The attack works in black-box mode: no access to weights, no gradients, just query access. An API endpoint is enough. The adversary doesn't need to know anything about the model internals.

Open problems

Two directions worth exploring.

Narrow domains first. A model trained exclusively on legal or medical text is probably more fragile, not less. The vocabulary is smaller, the patterns more predictable, the adversarial search space more constrained. The attack would likely need even fewer queries.

LLMs are the harder problem. The adversarial ML literature is increasingly focused there: jailbreaks, prompt injection, sleeper agents that activate only under specific trigger conditions. The evolutionary search approach generalizes in principle. The practical challenge is query cost and the stochastic nature of LLM output, which makes fitness evaluation inconsistent. Worth exploring.

Defenses are the open question. The obvious response is adversarial training: include adversarial examples in the training set. But adversarial training on one attack method doesn't generalize well to others, and the attack evolves. The arms race is ongoing.

Bruce Schneier put it well: "If you think technology can solve your security problems, then you don't understand the problems and you don't understand the technology."

A 93.9%-accurate model is genuinely good. It's also a system with structural vulnerabilities that can be located and exploited without any access to its internals. Understanding those vulnerabilities is the prerequisite for building systems that are actually reliable. Not just accurate on the test set.

The adversary always knows exactly where to look. Building the defense requires understanding the attack first.

I built the attack. The defense is the open problem.

The full thesis is in Italian and available on request. The implementation runs on Python 3.9-3.11, built on spaCy, WordNet (via NLTK), HuggingFace Transformers, BitsAndBytes, and pyinflect.