Local Intelligence Specifications

Avelyn AI — macOS Local Ollama Assistant Integration

Run optimized, high-performance generative models directly on your hardware with absolute privacy.

Configuring Your Local Ollama Assistant on macOS

To run Avelyn as a fully integrated **local Ollama assistant**, you need to ensure the Ollama system daemon is running on your machine. Ollama acts as the backend inference pipeline, packaging open-source model weights (like Meta's Llama 3 or Google's Gemma 3) and exposing a local REST API endpoint on your loopback address.

Follow these simple steps to initialize your environment:

  1. Download Ollama for macOS from the official portal and move it to your `/Applications` directory.
  2. Launch your Terminal app and fetch your preferred model. For instance, run: ollama run gemma3:4b.
  3. Confirm the daemon is listening by visiting http://localhost:11434 in your browser.
  4. Open the assistant settings window and choose your active model from the local model profiles list.

By linking the menu bar assistant to Ollama, you achieve instant, zero-latency text refinements system-wide. Learn more about the core mechanics on our About Avelyn Page or review common setups on the What is Avelyn FAQ Page.

Apple Silicon Benchmark Logs & Generation Speed

Performance scales directly with your chip's memory bandwidth. Because Apple M-series chips use unified memory, local inference does not suffer from CPU-to-GPU memory transfer overheads. A 4B parameter model like Gemma 3 Warm-boots instantly and generates tokens at speeds exceeding 70 tokens per second.

Here is a breakdown of average generation speeds across hardware setups:

  • Apple M4 Pro (Unified Memory): Gemma 3 (4B) → ~85 tokens/sec. Llama 3 (8B) → ~52 tokens/sec.
  • Apple M2 Max (Unified Memory): Gemma 3 (4B) → ~68 tokens/sec. Llama 3 (8B) → ~42 tokens/sec.
  • Apple M1 (Standard 8GB): Gemma 3 (4B) → ~35 tokens/sec. Llama 3 (8B) → ~20 tokens/sec.

This hardware-level efficiency matches or exceeds cloud generation latency, without routing text blocks over the internet. Read our comparison write-ups: Avelyn vs ChatGPT and Avelyn vs Grammarly.

Custom System Prompts & Context Length Optimization

The assistant allows you to write custom instructions to adjust the editor's behavior. By passing system prompts, you can instruct the local model to write in specific styles (e.g. academic, professional email, clean markdown format, or translation).

To avoid slowing down local performance, we recommend setting a modest context window of 2,048 or 4,096 tokens inside the system settings panel. This allocates sufficient memory for system-wide highlight rewrites without exhausting system RAM or causing chip thermal throttling.

Additionally, you can define target formatting instructions (such as "keep formatting exactly intact" or "output raw code block syntax only") in the prompt template. The system parses these instructions alongside user selection, formatting outputs instantly without adding unnecessary tokens to the generation path.

Troubleshooting & GPU Layer Allocation Limits

When running the assistant alongside resource-intensive software like video editors or compiling large projects, memory pressure can force the OS to page local model weights out of GPU memory. This shifts inference back to the CPU, increasing latency.

To resolve this, users can set custom parameters in the settings panel to lock model parameters in physical RAM (using `mlock`). We also recommend adjusting the GPU thread layer allocation count in Ollama's configuration file to reserve 2GB of unified VRAM for OS system processes, maintaining smooth generation speeds even under heavy CPU loads.

Local Processing Engine Capabilities

Apple Silicon Optimization

Avelyn utilizes the shared unified memory architectures of Apple M1, M2, M3, and M4 chips to execute LLM inferences without GPU/CPU context switches, giving you blazing-fast speed.

Sub-Second Prefills & Token Streaming

By using dynamic loading and caching strategies, local model warming is executed in the background. The rewritten text streams back token-by-token directly inside a minimalist window.

Zero Network Latency & Dependability

No api key quotas, no rate limits, and no server crashes. Avelyn works identical in remote mountain cabins, on flights, or in high-security corporate local areas.

Full Custom Model Support

Avelyn talks to Ollama via standard local ports. You can change your active model to any custom fine-tuned model (e.g. customized coding helpers) inside the UI panel.

Supported Offline Model Profiles

Gemma 3 (4B / 9B)

Developed by Google, this lightweight model is the default recommendation for Avelyn. Optimized for speed and quality in writing edits, executing in less than 1.5 seconds.

Default / Speed

Llama 3 (8B)

Meta's highly popular open weights model. Excellent for general structural updates, semantic rephrasing, and creative copy editing.

Prose / Structure

Mistral (7B)

Known for its rich linguistic ability, it serves as a powerful offline model for translation, formatting, and complex structural grammar checks.

Grammar / Formatting

CodeGemma (2B)

Designed specifically for code completion and debugging. Integrates with IDEs via Avelyn to analyze scripts offline.

Code Support