cat _posts/2026-06-24-local-model-testing-en.md
24 June 2026Local security LLM testing on Mac mini M4
lmstudio --local --prompt acmedesk-security-review
I tested local models on one practical task: analyze a hypothetical Windows Electron app without internet access and produce a safe bug-hunting plan. The main signal was not raw verbosity, but concrete Windows/Electron reasoning, safe local PoCs, and the ability to say what cannot be proven without the binary.
Short version: Qwen3.6 27B, Qwen3-Coder 30B, and Gemma 4 26B were the only outputs I would use as a serious starting point. Several smaller or security-tuned models produced familiar security words, but weak mechanisms.
Original prompt
Ты senior reverse engineer и Windows desktop apps bug hunter. Работай без интернета и не выдумывай факты.
Разбери гипотетическое приложение Windows:
- Electron desktop app
- есть автологин через saved token
- локальная база SQLite в `%APPDATA%\AcmeDesk\data.db`
- приложение открывает ссылки вида `acmedesk://open?path=...`
- есть автообновление через `https://updates.acme.local/latest.yml`
- в логах встречается строка:
`spawn powershell.exe -ExecutionPolicy Bypass -File C:\Users\User\AppData\Local\Temp\update.ps1`
- пользователь может импортировать `.zip` backup-файл, внутри которого `settings.json`, `profile.db`, `attachments/`
Задача:
1. Назови 10 потенциальных классов уязвимостей в таком приложении.
2. Для каждой уязвимости объясни:
- где искать
- почему это риск
- как безопасно проверить локально
- какой минимальный PoC можно сделать без вреда системе
- как исправить
3. Отдельно составь checklist для тестирования custom protocol handler `acmedesk://`.
4. Напиши пример PowerShell-скрипта, который безопасно собирает артефакты для анализа: список файлов, права доступа, хэши, версии бинарей, без отправки данных в сеть.
5. В конце выдели:
- наиболее вероятные баги
- самые критичные баги
- что нельзя утверждать без доступа к бинарю
Отвечай структурированно. Если чего-то не хватает, явно помечай как предположение.
Results
qwen/qwen3.6-27b
The best security reasoning: Electron IPC, DPAPI, ZipSlip, TOCTOU, updater signatures, `%TEMP%`, ProcMon, and a decent collector.
Open full answer
qwen3-coder-30b-a3b-instruct-mlx
Fast and useful. Good focus on protocol handling, ZipSlip, updater flow, temp files, and safer PowerShell collection.
Open full answer
google/gemma-4-26b-a4b
Compact and practical. It noticed the `update.ps1` TOCTOU angle, DPAPI, signed updates, and binary-access limits.
Open full answer
qwen3.5-9b Claude 4.6 HighIQ
Good brainstorming, but several confident technical mistakes kept it below the top tier.
Open full answerfoundation-sec-8b-reasoning-mlx
Respectable for an 8B model, but not deep enough compared with Qwen3.6, Qwen3-Coder, or Gemma.
Open full answer
mistralai/devstral-small-2-2512
Useful Windows checklist fragments, but too many ungrounded RCE claims without mechanism.
Open full answer
zai-org/glm-4.6v-flash
Better coverage than the weakest models, but weaker judgement and some unsafe PoC suggestions.
Open full answer
mistralai/magistral-small-2509
Cleaner than the weakest answers, but still too shallow for a real security review.
Open full answer
whiterabbitneo-v3-7b-mlx
Readable keyword generation, but it missed strong prompt signals like updater scripts, signing, DPAPI, and Electron-specific RCE conditions.
Open full answer
deepseek-r1-0528-qwen3-8b-mlx
Found broad surfaces, but failed the requested format and missed safe minimal PoCs and a good protocol checklist.
Open full answer
ravenx-sec-8b-security-rath-128k-mlx
Disappointing for a security fine-tune: repetitive, overconfident, and light on Electron/Windows mechanics.
Open full answer
openai-gpt-oss-20b-instruct
Structured on the surface, but too many generic labels and strange fixes. I would not trust it as a research plan.
Open full answer
codestral-22b-v0.1
The answer was mostly a generic corporate checklist, not a security assessment.
Open full answer
vulnllm-r-7b
The weakest result: mostly CWE-like words with little understanding of the scenario.
Open full answerTakeaway
For local security work, the best models were the ones that stayed close to the artifacts: acmedesk://, SQLite, saved token storage, latest.yml, update.ps1, and backup ZIP import. The weak models sounded security-fluent, but skipped the engineering path from signal to verification.