Local & Cloud Specs

Avelyn AI — macOS Multi-Provider & API Key Integration

Run optimized local language models or securely connect to cloud APIs with seamless routing.

Configuring Your Local Ollama Assistant on macOS

To run Avelyn as a fully integrated **local Ollama assistant**, you need to ensure the Ollama system daemon is running on your machine. Ollama acts as the backend inference pipeline, packaging open-source model weights (like Meta's Llama 3 or Google's Gemma 3) and exposing a local REST API endpoint on your loopback address.

Follow these simple steps to initialize your environment:

Download Ollama for macOS from the official portal and move it to your `/Applications` directory.
Launch your Terminal app and fetch your preferred model. For instance, run: ollama run gemma3:4b.
Confirm the daemon is listening by visiting http://localhost:11434 in your browser.
Open the assistant settings window and choose your active model from the local model profiles list.

By linking the menu bar assistant to Ollama, you achieve instant, zero-latency text refinements system-wide. Learn more about the core mechanics on our About Avelyn Page or review common setups on the What is Avelyn FAQ Page.

Integrating Cloud APIs & OpenRouter Keys

If you prefer not to download and compile large model weights locally, or want to access global frontier models, Avelyn supports direct cloud provider integrations. You can connect your custom API keys for OpenRouter (Avelyn Cloud) or standard OpenAI-compatible endpoints.

To integrate your cloud API keys:

Open the Settings panel and navigate to the AI Provider tab.
Select the Single Provider mode or configure overrides inside the Smart Router.
Under **Avelyn Cloud**, paste your OpenRouter API Key (prefixed with sk-or-v1-...). Your key will be permanently masked with password bullet points (••••••••) for strict UI privacy.
Choose a model from the list of options, such as google/gemini-2.5-flash or google/gemini-2.5-flash.
To connect custom local endpoints (like LM Studio or vLLM) or raw OpenAI servers, choose the Custom API provider option and fill in your base URL, API Key, and target model ID.

API keys are stored strictly in local configuration files. Avelyn does not route keys through third-party servers, ensuring complete telemetry privacy.

Apple Silicon Benchmark Logs & Generation Speed

Performance scales directly with your chip's memory bandwidth. Because Apple M-series chips use unified memory, local inference does not suffer from CPU-to-GPU memory transfer overheads. A 4B parameter model like Gemma 3 Warm-boots instantly and generates tokens at speeds exceeding 70 tokens per second.

Here is a breakdown of average generation speeds across hardware setups:

Apple M4 Pro (Unified Memory): Gemma 3 (4B) → ~85 tokens/sec. Llama 3 (8B) → ~52 tokens/sec.
Apple M2 Max (Unified Memory): Gemma 3 (4B) → ~68 tokens/sec. Llama 3 (8B) → ~42 tokens/sec.
Apple M1 (Standard 8GB): Gemma 3 (4B) → ~35 tokens/sec. Llama 3 (8B) → ~20 tokens/sec.

This hardware-level efficiency matches or exceeds cloud generation latency, without routing text blocks over the internet. Read our comparison write-ups: Avelyn vs ChatGPT and Avelyn vs Grammarly.

Custom System Prompts & Context Length Optimization

The assistant allows you to write custom instructions to adjust the editor's behavior. By passing system prompts, you can instruct the local model to write in specific styles (e.g. academic, professional email, clean markdown format, or translation).

To avoid slowing down local performance, we recommend setting a modest context window of 2,048 or 4,096 tokens inside the system settings panel. This allocates sufficient memory for system-wide highlight rewrites without exhausting system RAM or causing chip thermal throttling.

Additionally, you can define target formatting instructions (such as "keep formatting exactly intact" or "output raw code block syntax only") in the prompt template. The system parses these instructions alongside user selection, formatting outputs instantly without adding unnecessary tokens to the generation path.

Troubleshooting & GPU Layer Allocation Limits

When running the assistant alongside resource-intensive software like video editors or compiling large projects, memory pressure can force the OS to page local model weights out of GPU memory. This shifts inference back to the CPU, increasing latency.

To resolve this, users can set custom parameters in the settings panel to lock model parameters in physical RAM (using `mlock`). We also recommend adjusting the GPU thread layer allocation count in Ollama's configuration file to reserve 2GB of unified VRAM for OS system processes, maintaining smooth generation speeds even under heavy CPU loads.

Local Processing Engine Capabilities

Apple Silicon Optimization

Avelyn utilizes the shared unified memory architectures of Apple M1, M2, M3, and M4 chips to execute LLM inferences without GPU/CPU context switches, giving you blazing-fast speed.

Sub-Second Prefills & Token Streaming

By using dynamic loading and caching strategies, local model warming is executed in the background. The rewritten text streams back token-by-token directly inside a minimalist window.

Zero Network Latency & Dependability

No api key quotas, no rate limits, and no server crashes. Avelyn works identical in remote mountain cabins, on flights, or in high-security corporate local areas.

Full Custom Model Support

Avelyn talks to Ollama via standard local ports. You can change your active model to any custom fine-tuned model (e.g. customized coding helpers) inside the UI panel.

Supported Offline & Cloud Model Profiles

Gemini 2.5 Flash

Google's ultra-low-latency model. Processes system-wide highlight requests in ~1 second (TTFT). Default cloud provider for speed and quality.

Cloud / Speed / Default

Gemini 2.5 Flash

Fast, extremely cheap, and highly capable cloud model. Ideal for complex reasoning tasks and large document summarizations.

Cloud / Reasoning

Claude 3.5 Haiku

Anthropic's low-latency model, famous for coding task completions, technical writing, and structural email updates.

Cloud / Writing

Gemma 3 (4B / 9B)

Developed by Google, this lightweight model is optimized for local execution. Optimized for speed and quality in writing edits.

Local / Offline

Llama 3 (8B)

Meta's highly popular open weights model. Excellent for general structural updates, semantic rephrasing, and creative copy editing.

Local / Prose

Mistral (7B)

Known for its rich linguistic ability, it serves as a powerful offline model for translation, formatting, and complex structural grammar checks.

Local / Grammar