Exploring Ollama & LM Studio
Exploring Ollama & LM Studio
Is this article for me?
If you are looking answers to the following questions, then this article is for you:
- Question: What is Ollama? Is it like Docker?
- Question: How is Ollama different from Docker?
- Question: How to install ollama on my machine?
- Question: How to create customized LLM Model (docker like image)?
- Question: What are the LLM available on ollama?
- Question: Can we integrate these hundreds with different UI like ChatGPT?
- Question: If I want to use all these Ollama models via Jupyter Notebook then what to do?
- Question: Does Ollama have plugins like github copilot? Can I use those from my visual code?
- Question: What kind of software are LM Studio or Ollama?
- Question: What is LM Studio and how different it is from Ollama?
- Question: What are different formats to save model, specifically LLMs?
- Question: What is gguf model extention?
- Question: If I have finetuned my models using clouds like aws sagemaker, vertexai, azure and kept there then can I use them inside my ollama and LM Studio?
Question: What is Ollama? Is it like Docker?
Ollama is a platform designed to make running and interacting with large language models (LLMs) easier. It abstracts away the complexities of managing LLM models, GPU resources, and related configurations by offering a simple CLI interface. With Ollama, you can run, manage, and deploy LLMs locally or in various cloud environments without having to worry about the intricate details of setting up environments, downloading models, or configuring them.
Key Features of Ollama:
- Model Management: Ollama can download and store LLMs in a local cache for you to run, typically in a format optimized for the hardware available (like your local GPU).
- GPU/CPU Utilization: It detects hardware resources, such as your NVIDIA GPU, and automatically uses them for model acceleration without additional setup.
- Service Setup: When you install Ollama, it sets up a service running in the background that serves models on an API, so you can interact with them programmatically.
Question: How is Ollama different from Docker?
While Ollama and Docker both deal with isolated environments, they serve different purposes:
- Ollama focuses specifically on running machine learning models, especially large language models, and optimizes resources to make them easily accessible and deployable.
- Docker is a general-purpose containerization tool that allows you to package applications with their dependencies in isolated environments. It’s used for deploying a wide variety of applications, not just models.
So, while Docker might also be used to set up machine learning environments or serve models, Ollama is specialized and optimized for the LLM use case.
In Summary: Ollama = Model management platform for LLMs, with easy CLI and automatic resource optimization. Docker = General containerization tool for deploying all types of applications in isolated environments.
Question: How to install ollama on my machine?
Refer: https://ollama.com/download/linux, and https://github.com/ollama/ollama, and https://github.com/ollama/ollama-python
- To download Ollama on linux/wsl:
curl -fsSL https://ollama.com/install.sh | sh - To run
ollama run phi3
http://127.0.0.1:11434/ - ollama is running
Question: How to create customized LLM Model (docker like image)?
If you know the working of Docker image, container, docker hub, docker command then you will feel at home with ollama commands.
Step 1: Create a ModelFile
FROM llama3.1
# set the temperature to 1 [higher is more creative, lower is more coherent]
PARAMETER temperature 1
# set the system message
SYSTEM """
You are Travel Advisor from Air India Airlines. Answer as AI Advisor, the assistant, only.
"""
Step 2: Create and run the model
ollama create aiadvisor -f ./Modelfile
ollama run aiadvisor
>>> hi
Hello! It's your friend AI Advisor.
Question: What are the LLM available on ollama?
There are 100+ LLM available via ollama. They have different capabilities in terms of domain task like coding, embedding, reasoning, chatting, philosophy, medical, maths, function calling. And in terms of context window 8k, 16k, 24k, 128k etc. And in terms of hardware/gpu required or not to run these.
Chatting/Assistant/
- alfred: A robust conversational model designed to be used for both chat and instruct use cases.
- all-minilm: Embedding models on very large sentence level datasets.
- An experimental 1.1B parameter model trained on the new Dolphin 2.8 dataset by Eric Hartford and based on TinyLlama.
- Aya 23, released by Cohere, is a new family of state-of-the-art, multilingual models that support 23 languages.
- bge-large: Embedding model from BAAI mapping texts to vectors.
- BGE-M3 is a new Embedding model from BAAI distinguished for its versatility in Multi-Functionality, Multi-Linguality, and Multi-Granularity.
- Command R is a Large Language Model optimized for conversational interaction and long context tasks.
- Command R+ is a powerful, scalable large language model purpose-built to excel at real-world enterprise use cases.
- DBRX is an open, general-purpose LLM created by Databricks.
- deepseek-llm: An advanced language model crafted with 2 trillion bilingual tokens.
- deepseek-v2: A strong, economical, and efficient Mixture-of-Experts language model.
- deepseek-v2.5: An upgraded version of DeekSeek-V2 that integrates the general and coding abilities of both DeepSeek-V2-Chat and DeepSeek-Coder-V2-Instruct.
- Dolphin 2.9 is a new model with 8B and 70B sizes by Eric Hartford based on Llama 3 that has a variety of instruction, conversational, and coding skills.
- dolphin-mixtral: Uncensored, 8x7b and 8x22b fine-tuned models based on the Mixtral mixture of experts models that excels at coding tasks. Created by Eric Hartford.
- everythinglm: Uncensored Llama2 based model with support for a 16K context window.
- falcon: A large language model built by the Technology Innovation Institute (TII) for use in summarization, text generation, and chat bots.
- Gemma is a family of lightweight, state-of-the-art open models built by Google DeepMind.
- glm4: A strong multi-lingual general language model with competitive performance to Llama 3.
- goliath: A language model created by combining two fine-tuned Llama 2 70B models into one.
- Google Gemma 2 is a high-performing and efficient model available in three sizes: 2B, 9B, and 27B.
- Hermes 3 is the latest version of the flagship Hermes series of LLMs by Nous Research
- Llama 2 is a collection of foundation language models ranging from 7B to 70B parameters.
- Llama 3.1 is a new state-of-the-art model from Meta available in 8B, 70B and 405B parameter sizes.
- llama2-chinese: Llama 2 based model fine tuned to improve Chinese dialogue ability.
- llama3-chatqa: A model from NVIDIA based on Llama 3 that excels at conversational question answering (QA) and retrieval-augmented generation (RAG).
- llama3-gradient: This model extends LLama-3 8B’s context length from 8k to over 1m tokens.
- MegaDolphin-2.2-120b is a transformation of Dolphin-2.2-70b created by interleaving the model with itself.
- Meta Llama 3: The most capable openly available LLM to date
- Mistral OpenOrca is a 7 billion parameter model, fine-tuned on top of the Mistral 7B model using the OpenOrca dataset.
- mistral-nemo: A state-of-the-art 12B model with 128k context length, built by Mistral AI in collaboration with NVIDIA.
- mistral-small: Mistral Small is a lightweight model designed for cost-effective use in tasks like translation and summarization.
- MistralLite is a fine-tuned model based on Mistral with enhanced capabilities of processing long contexts.
- mixtral: A set of Mixture of Experts (MoE) model with open weights by Mistral AI in 8x7b and 8x22b parameter sizes.
- neural-chat: A fine-tuned model based on Mistral with good coverage of domain and language.
- notus: A 7B chat model fine-tuned with high-quality data and based on Zephyr.
- notux: A top-performing mixture of experts model, fine-tuned with high-quality data.
- nous-hermes: General use models based on Llama and Llama 2 from Nous Research.
- nous-hermes2: The powerful family of models by Nous Research that excels at scientific discussion and coding tasks.
- nuextract: A 3.8B model fine-tuned on a private high-quality synthetic dataset for information extraction, based on Phi-3.
- OpenHermes 2.5 is a 7B model fine-tuned by Teknium on Mistral with fully open datasets.
- orca-mini: A general-purpose model ranging from 3 billion parameters to 70 billion, suitable for entry-level hardware.
- Phi-3 is a family of lightweight 3B (Mini) and 14B (Medium) state-of-the-art open models by Microsoft.
- phi3.5: A lightweight AI model with 3.8 billion parameters with performance overtaking similarly and larger sized models.
- Qwen 1.5 is a series of large language models by Alibaba Cloud spanning from 0.5B to 110B parameters
- Qwen2 is a new series of large language models from Alibaba group
- reader-lm: A series of models that convert HTML content to Markdown content, which is useful for content conversion tasks.
- samantha-mistral: A companion assistant trained in philosophy, psychology, and personal relationships. Based on Mistral.
- smollm: A family of small models with 135M, 360M, and 1.7B parameters, trained on a new high-quality dataset.
- solar: A compact, yet powerful 10.7B large language model designed for single-turn conversation.
- Stable LM 2 is a state-of-the-art 1.6B and 12B parameter language model trained on multilingual data in English, Spanish, German, Italian, French, Portuguese, and Dutch.
- stable-beluga: Llama 2 based model fine tuned on an Orca-style dataset. Originally called Free Willy.
- stablelm-zephyr: A lightweight chat model allowing accurate, and responsive output without requiring high-end hardware.
- Starling is a large language model trained by reinforcement learning from AI feedback focused on improving chatbot helpfulness.
- The Nous Hermes 2 model from Nous Research, now trained over Mixtral.
- The TinyLlama project is an open endeavor to train a compact 1.1B Llama model on 3 trillion tokens.
- vicuna: General use chat model based on Llama and Llama 2 with 2K to 16K context sizes.
- Wizard Vicuna is a 13B parameter model based on Llama 2 trained by MelodysDreamj.
- Wizard Vicuna Uncensored is a 7B, 13B, and 30B parameter model based on Llama 2 uncensored by Eric Hartford.
- wizardlm-uncensored: Uncensored version of Wizard LM model
- xwinlm: Conversational model based on Llama 2 that performs competitively on various benchmarks.
- yarn-llama2: An extension of Llama 2 that supports a context of up to 128k tokens.
- yarn-mistral: An extension of Mistral to support context windows of 64K or 128K.
- Yi 1.5 is a high-performing, bilingual language model.
- Zephyr is a series of fine-tuned versions of the Mistral and Mixtral models that are trained to act as helpful assistants.
Multimodal & Vision
- BakLLaVA is a multimodal (vision) model consisting of the Mistral 7B base model augmented with the LLaVA architecture.
- minicpm-v: A series of multimodal LLMs (MLLMs) designed for vision-language understanding.
- LLaVA is a novel end-to-end trained large multimodal model that combines a vision encoder and Vicuna for general-purpose visual and language understanding.
- llava-llama3: A LLaVA (vision) model fine-tuned from Llama 3 Instruct with better scores in several benchmarks.
- llava-phi3: A new small LLaVA (vision) model fine-tuned from Phi 3 Mini.
- moondream2 is a small vision language model designed to run efficiently on edge devices.
Math
- llama-pro: An expansion of Llama 2 that specializes in integrating both general language understanding and domain-specific knowledge, particularly in programming and mathematics.
- Qwen2 Math is a series of specialized math language models built upon the Qwen2 LLMs, which significantly outperforms the mathematical capabilities of open-source models and even closed-source models (e.g., GPT4o).
- wizard-math: Model focused on math and logic problems
Coding
- codellama: A large language model that can use text prompts to generate and discuss code.
- codegeex4: A versatile model for AI software development scenarios, including code completion.
- codeup: Great code generation model based on Llama2.
- codebooga: A high-performing code instruct model created by merging two existing code models.
- Magicoder is a family of 7B parameter models trained on 75K synthetic instruction data using OSS-Instruct, a novel approach to enlightening LLMs with open-source code snippets.
- wizardcoder: State-of-the-art code generation model
- phind-codellama: Code generation model based on Code Llama.
- dolphincoder: A 7B and 15B uncensored variant of the Dolphin model family that excels at coding, based on StarCoder2.
- granite-code: A family of open foundation models by IBM for Code Intelligence
- deepseek-coder-v2: An open-source Mixture-of-Experts code language model that achieves performance comparable to GPT4-Turbo in code-specific tasks.
- SQLCoder is a code completion model fined-tuned on StarCoder for SQL generation tasks
- StarCoder is a code generation model trained on 80+ programming languages.
- Yi-Coder is a series of open-source code language models that delivers state-of-the-art coding performance with fewer than 10 billion parameters.
- Codestral is Mistral AI’s first-ever code model designed for code generation tasks.
- Falcon2 is an 11B parameters causal decoder-only model built by TII and trained over 5T tokens.
- Stable Code 3B is a coding model with instruct and code completion variants on par with models such as Code Llama 7B that are 2.5x larger.
- StarCoder2 is the next generation of transparently trained open code LLMs that comes in three sizes: 3B, 7B and 15B parameters.
- DeepSeek Coder is a capable coding model trained on two trillion code and natural language tokens.
- CodeQwen1.5 is a large language model pretrained on a large amount of code data.
- Mistral Large 2 is Mistral’s new flagship model that is significantly more capable in code generation, mathematics, and reasoning with 128k context window and support for dozens of languages.
- open-orca-platypus2: Merge of the Open Orca OpenChat model and the Garage-bAInd Platypus 2 model. Designed for chat and code generation.
- CodeGemma is a collection of powerful, lightweight models that can perform a variety of coding tasks like fill-in-the-middle code completion, code generation, natural language understanding, mathematical reasoning, and instruction following.
Embedding
- nomic-embed-text: A high-performing open embedding model with a large token context window.
- mxbai-embed-large
- snowflake-arctic-embed: A suite of text embedding models by Snowflake, optimized for performance.
- State-of-the-art large embedding model from mixedbread.ai
- paraphrase-multilingual: Sentence-transformers (embedding) model that can be used for tasks like clustering or semantic search.
Medical
- medllama2: Fine-tuned Llama 2 model to answer medical questions based on an open source medical dataset.
- meditron: Open-source medical large language model adapted from Llama 2 to the medical domain.
Function Calling
- Nexus Raven is a 13B instruction tuned model for function calling tasks.
- llama3-groq-tool-use: A series of models from Groq that represent a significant advancement in open-source AI capabilities for tool use/function calling.
- firefunction-v2: An open weights function calling model based on Llama 3, competitive with GPT-4o function calling capabilities.
Reasoning
- mathstral: MathΣtral: a 7B model designed for math reasoning and scientific discovery by Mistral AI.
- Phi-2: a 2.7B language model by Microsoft Research that demonstrates outstanding reasoning and language understanding capabilities.
- InternLM2.5 is a 7B parameter model tailored for practical scenarios with outstanding reasoning capability.
- wizardlm2: State of the art large language model from Microsoft AI with improved performance on complex chat, multilingual, reasoning and agent use cases.
- reflection: A high-performing model trained with a new technique called Reflection-tuning that teaches a LLM to detect mistakes in its reasoning and correct course.
- Orca 2 is built by Microsoft research, and are a fine-tuned version of Meta’s Llama 2 models. The model is designed to excel particularly in reasoning.
Question: Can we integrate these hundreds with different UI like ChatGPT?
Yes, in fact you need NOT to create any new UI. Hundreds of good UI are available which are integrated with these hundreds of LLMs available on Ollama. You can see below some of the popular UI via which Ollama models can be accessed.
- Open WebUI
- Enchanted (macOS native)
- Hollama
- Lollms-Webui
- LibreChat
- Bionic GPT
- HTML UI
- Saddle
- Chatbot UI
- Chatbot UI v2
- Typescript UI
- Minimalistic React UI for Ollama Models
- Ollamac
- big-AGI
- Cheshire Cat assistant framework
- Amica
- chatd
- Ollama-SwiftUI
- Dify.AI
- MindMac
- NextJS Web Interface for Ollama
- Msty
- Chatbox
- WinForm Ollama Copilot
- NextChat with Get Started Doc
- Alpaca WebUI
- OllamaGUI
- OpenAOE
- Odin Runes
- LLM-X (Progressive Web App)
- AnythingLLM (Docker + MacOs/Windows/Linux native app)
- Ollama Basic Chat: Uses HyperDiv Reactive UI
- Ollama-chats RPG
- QA-Pilot (Chat with Code Repository)
- ChatOllama (Open Source Chatbot based on Ollama with Knowledge Bases)
- CRAG Ollama Chat (Simple Web Search with Corrective RAG)
- RAGFlow (Open-source Retrieval-Augmented Generation engine based on deep document understanding)
- StreamDeploy (LLM Application Scaffold)
- chat (chat web app for teams)
- Lobe Chat with Integrating Doc
- Ollama RAG Chatbot (Local Chat with multiple PDFs using Ollama and RAG)
- BrainSoup (Flexible native client with RAG & multi-agent automation)
- macai (macOS client for Ollama, ChatGPT, and other compatible API back-ends)
- Olpaka (User-friendly Flutter Web App for Ollama)
- OllamaSpring (Ollama Client for macOS)
- LLocal.in (Easy to use Electron Desktop Client for Ollama)
- AiLama (A Discord User App that allows you to interact with Ollama anywhere in discord )
- Ollama with Google Mesop (Mesop Chat Client implementation with Ollama)
- Painting Droid (Painting app with AI integrations)
- Kerlig AI (AI writing assistant for macOS)
- AI Studio
- Sidellama (browser-based LLM client)
- LLMStack (No-code multi-agent framework to build LLM agents and workflows)
- BoltAI for Mac (AI Chat Client for Mac)
- Harbor (Containerized LLM Toolkit with Ollama as default backend)
- Go-CREW (Powerful Offline RAG in Golang)
- PartCAD (CAD model generation with OpenSCAD and CadQuery)
- Ollama4j Web UI - Java-based Web UI for Ollama built with Vaadin, Spring Boot and Ollama4j
- PyOllaMx - macOS application capable of chatting with both Ollama and Apple MLX models.
- Claude Dev - VSCode extension for multi-file/whole-repo coding
- Cherry Studio (Desktop client with Ollama support)
- ConfiChat (Lightweight, standalone, multi-platform, and privacy focused LLM chat interface with optional encryption)
- Archyve (RAG-enabling document library)
- crewAI with Mesop (Mesop Web Interface to run crewAI with Ollama)
For mobile UI, you can explore these resources.
- Enchanted
- Maid
- ConfiChat (Lightweight, standalone, multi-platform, and privacy focused LLM chat interface with optional encryption)
Question: If I want to use all these Ollama models via Jupyter Notebook then what to do?
There are dozens of libraries which integrate ollama models. You can pip install and use these libraries in your python code. Some of these popular libraries are:
- LangChain and LangChain.js with example
- Firebase Genkit
- crewAI
- LangChainGo with example
- LangChain4j with example
- LangChainRust with example
- LlamaIndex
- LiteLLM
- OllamaFarm for Go
- OllamaSharp for .NET
- Ollama for Ruby
- Ollama-rs for Rust
- Ollama-hpp for C++
- Ollama4j for Java
- ModelFusion Typescript Library
- OllamaKit for Swift
- Ollama for Dart
- Ollama for Laravel
- LangChainDart
- Semantic Kernel - Python
- Haystack
- Elixir LangChain
- Ollama for R - rollama
- Ollama for R - ollama-r
- Ollama-ex for Elixir
- Ollama Connector for SAP ABAP
- Testcontainers
- Portkey
- PromptingTools.jl with an example
- LlamaScript
- Gollm
- Ollamaclient for Golang
- High-level function abstraction in Go
- Ollama PHP
- Agents-Flex for Java with example
Question: Does Ollama have plugins like github copilot? Can I use those from my visual code?
Yes, there are many plugins like that for different purpose apart from coding. Even for the coding there are dozens of plugin available with different capabilities. And you need not to pay for these plugins like you have to pay monthly to Micorosoft! Some of those plugins are :
- Llama Coder (Copilot alternative using Ollama)
- Ollama Copilot (Proxy that allows you to use ollama as a copilot like Github copilot)
- Copilot for Obsidian plugin
- Raycast extension
- Discollama (Discord bot inside the Ollama discord channel)
- Continue
- Obsidian Ollama plugin
- Logseq Ollama plugin
- NotesOllama (Apple Notes Ollama plugin)
- Dagger Chatbot
- Discord AI Bot
- Ollama Telegram Bot
- Hass Ollama Conversation
- Rivet plugin
- Obsidian BMO Chatbot plugin
- Cliobot (Telegram bot with Ollama support)
- Obsidian Local GPT plugin
- Open Interpreter
- twinny (Copilot and Copilot chat alternative using Ollama)
- Wingman-AI (Copilot code and chat alternative using Ollama and Hugging Face)
- Page Assist (Chrome Extension)
- Plasmoid Ollama Control (KDE Plasma extension that allows you to quickly manage/control Ollama model)
- AI Telegram Bot (Telegram bot using Ollama in backend)
- AI ST Completion (Sublime Text 4 AI assistant plugin with Ollama support)
- Discord-Ollama Chat Bot (Generalized TypeScript Discord Bot w/ Tuning Documentation)
- Discord AI chat/moderation bot Chat/moderation bot written in python. Uses Ollama to create personalities.
- Headless Ollama (Scripts to automatically install ollama client & models on any OS for apps that depends on ollama server)
- vnc-lm (A containerized Discord bot with support for attachments and web links)
- LSP-AI (Open-source language server for AI-powered functionality)
- QodeAssist (AI-powered coding assistant plugin for Qt Creator)
- Obsidian Quiz Generator plugin
Question: What kind of software are LM Studio or Ollama?
Their role is to facilitate easy use of these models by providing a platform that supports multiple models, offering features like local deployment, training, and experimentation without needing to deal with the complex setup each model requires.
They are platform or interface for LLM: Both Ollama and LM Studio are software tools that allow users to interact with, run, fine-tune, and experiment with multiple large language models. They are more like model management tools or LLM execution environments rather than models themselves.
They are Model Hub: These tools serve as hubs where you can load, execute, and work with a variety of pre-trained LLMs. Instead of being limited to one specific model like GPT-4, they allow users to work with models like LLaMA, GPT-3, GPT-4, and others.
They are Model Runners: Since they enable the running and execution of multiple models.
They are LLM Execution/Management Tools: They manage various models and allow you to deploy them.
You can think of them as infrastructure that abstracts away the complexities of working with different LLMs.
Question: What is LM Studio and how different it is from Ollama?
Ollama is designed for ease of use and running pre-trained models locally, perfect for developers and non-technical users who prioritize simplicity, privacy, and API-based integration. LM Studio is a research-oriented tool for AI engineers and researchers who need in-depth control over model fine-tuning, training, and experimentation, with a steeper learning curve but greater flexibility.
Here’s a detailed comparison of Ollama and LM Studio in terms of their capabilities:
Feature/Capability | Ollama | LM Studio |
---|---|---|
Primary Purpose | Run and manage multiple LLMs locally | Research, experimentation, fine-tuning, and training of LLMs |
Supported Models | LLaMA, GPT-3, GPT-4, and other popular LLMs | LLaMA, GPT, custom LLMs, and more, with a focus on fine-tuning |
Local Model Execution | Supports running models locally without cloud dependencies | Allows for local execution, including training and experimentation |
Model Fine-tuning/Training | No, typically runs pre-trained models | Yes, built for fine-tuning LLMs on custom datasets |
Experimentation Tools | Minimal experimentation features, more focused on simple deployment | Extensive tools for experimenting with models, datasets, and hyperparameters |
Ease of Use | Simple, user-friendly interface for non-technical users | More advanced, with a steeper learning curve but richer in functionality for researchers |
Hardware Requirements | Optimized for running on GPUs or CPUs locally | Requires higher-end hardware (GPUs) for fine-tuning and training |
Privacy | Strong privacy due to local model execution | Supports local execution for privacy, but also scales to cloud-based setups |
API Integration | Yes, offers APIs to integrate LLMs into custom applications | Primarily a standalone platform, with some ability for integration into workflows |
Cloud Integration | Primarily local execution; not designed for cloud-based workflows | Supports both local and cloud-based training environments, useful for large-scale training |
Model Deployment | Can be deployed locally or integrated via API into applications | Typically used for experimentation, research, and training, with some deployment capabilities |
Pre-Trained Models | Easy access to pre-trained models (LLaMA, GPT, etc.) | Access to a variety of pre-trained models (hugging face and others), with emphasis on customization and fine-tuning |
Target Audience | Developers, non-technical users who want easy local LLM access | AI researchers, developers, engineers who require deeper control and experimentation |
Community & Support | Developer-focused community | Strong research community with contributions from AI developers |
Key Capabilities of Ollama:
- Run LLMs Locally: Focuses on running pre-trained models such as GPT-3, GPT-4, and LLaMA on your local machine without requiring cloud dependencies.
- Simple Setup: Aimed at developers and non-technical users who want easy access to LLMs.
- Privacy & Security: Since models run locally, no data is sent to external servers, enhancing privacy.
- API Integration: Provides APIs to integrate models into applications, making it useful for local deployment.
- Resource Optimization: Automatically manages local system resources, including CPU and GPU, to run models efficiently.
Key Capabilities of LM Studio:
- Fine-Tuning LLMs: Supports fine-tuning pre-trained LLMs on custom datasets, ideal for research and development.
- Model Training: Enables the training of LLMs from scratch or with specific hyperparameter configurations.
- Advanced Experimentation: Provides tools to run experiments, tweak models, and monitor results for research purposes.
- Customizable Infrastructure: Gives more control over hardware resources and configuration, allowing for scaling on cloud or local machines.
- Open-Source Platform: Built for researchers, it has a rich ecosystem of community-driven features and tools.
Question: What are different formats to save model, specifically LLMs?
Large language models (LLMs) can be stored in various formats, each suited for different purposes and platforms. These formats cater to different needs, from interoperability between frameworks (ONNX) to specific hardware optimizations (OpenVINO, TensorFlow Lite). The choice of format depends on the specific requirements of the deployment environment and the tools being used.
Here are some common model formats used for LLMs:
1. PyTorch (.pt
, .pth
)
- Description: Files with
.pt
or.pth
extensions are commonly used to store PyTorch models. These files contain the model’s weights and architecture. - Usage: Typically used with PyTorch frameworks for loading and running models.
- Example: Models saved using
torch.save(model.state_dict(), 'model.pth')
.
2. TensorFlow (.pb
, .h5
, .tf
)
- Description: TensorFlow models can be saved in multiple formats:
.pb
(Protocol Buffers): Used for saving the complete model, including weights and architecture..h5
(HDF5): Used for saving models in Keras (which is a high-level API for TensorFlow)..tf
: Used for saving TensorFlow models in the SavedModel format.
- Usage: Used with TensorFlow for model deployment and inference.
- Example: Models saved using
model.save('model.h5')
ortf.saved_model.save(model, 'saved_model')
.
3. ONNX (.onnx
)
- Description: Open Neural Network Exchange (ONNX) is a format for representing deep learning models. It allows interoperability between different deep learning frameworks.
- Usage: Enables models trained in one framework (like PyTorch) to be used in another (like TensorFlow).
- Example: Models converted to ONNX using
torch.onnx.export(model, inputs, 'model.onnx')
.
4. OpenVINO (.bin
, .xml
)
- Description: OpenVINO uses
.bin
and.xml
files to represent optimized models for Intel hardware. - Usage: Provides acceleration for inference on Intel devices.
- Example: Models optimized with OpenVINO are stored in
.xml
(model structure) and.bin
(weights) files.
5. GGUF (.gguf
)
- Description: Generalized Graph Universal Format (GGUF) is a format used by Meta for storing LLaMA models. It provides a standardized way to store and share large language models.
- Usage: Specifically designed for LLaMA models but can be used more broadly for LLMs.
- Example: Models saved in GGUF format will have the
.gguf
file extension.
6. SavedModel (SavedModel
)
- Description: TensorFlow’s SavedModel format includes a directory with serialized model weights, graph definitions, and metadata.
- Usage: TensorFlow’s recommended format for serving models in production.
- Example: SavedModel format directory includes files like
saved_model.pb
and a variables directory.
7. Core ML (.mlmodel
)
- Description: Apple’s Core ML format is used for deploying models on iOS, macOS, watchOS, and tvOS.
- Usage: Used for integrating machine learning models into Apple applications.
- Example: Models converted to Core ML using tools like
coremltools
.
8. TensorFlow Lite (.tflite
)
- Description: A format for deploying TensorFlow models on mobile and edge devices. It provides a smaller, more efficient representation of the model.
- Usage: Optimized for mobile and embedded devices.
- Example: Models converted to TensorFlow Lite format using
tf.lite.TFLiteConverter
.
9. Hugging Face (.bin
, config.json
, tokenizer.json
)
- Description: Hugging Face models typically use
.bin
files for weights and JSON files for configuration and tokenizers. This format is often associated with the Transformers library. - Usage: Used with Hugging Face’s Transformers library for loading and fine-tuning models.
- Example: Models from Hugging Face’s model hub include
.bin
files for weights and configuration files.
10. Hugging Face (.safetensors
)
- Description: SafeTensors (developed recently by hugging face) is a format developed for safely and efficiently storing tensor data, particularly for large language models. It aims to provide secure and consistent handling of model weights.
- Usage: Designed to improve safety and integrity in model storage by addressing issues related to file corruption and ensuring the integrity of the model data. It’s increasingly used in machine learning and AI communities for its security benefits.
- Key Features:
- Safety: Ensures data integrity and helps prevent corruption.
- Efficiency: Optimized for storage and retrieval of large model weights.
- Compatibility: Designed to be used with various frameworks and tools that support tensor-based models.
Summary of Model Formats Including SafeTensors:
- PyTorch (
.pt
,.pth
) - TensorFlow (
.pb
,.h5
,.tf
) - ONNX (
.onnx
) - Hugging Face (
.bin
,config.json
,tokenizer.json
) - GGUF (
.gguf
) - SavedModel (
SavedModel
) - Core ML (
.mlmodel
) - TensorFlow Lite (
.tflite
) - OpenVINO (
.bin
,.xml
) - SafeTensors (
.safetensors
)
Question: What is gguf model extention?
The GGUF (Generalized Graph Universal Format) is designed to provide a standardized format for storing and sharing large language models. It aims to facilitate the interoperability of models across different platforms and tools. GGUF is particularly associated with Meta’s LLaMA (Large Language Model Meta AI) series of models. It is used for representing the weights and configurations of these models in a way that can be easily loaded and utilized across different environments.
GGUF format aims:
- Standardization: GGUF aims to standardize how model data is stored and exchanged, making it easier to work with LLaMA models and potentially other models that adopt this format.
- Efficiency: The format is designed to efficiently handle the large size of modern language models, ensuring that models can be loaded and processed quickly.
Question: If I have finetuned my models using clouds like aws sagemaker, vertexai, azure and kept there then can I use them inside my ollama and LM Studio?
Yes, we can use them
Method 1: API integration
- Obtain the endpoint URL and API key from cloud platform (Vertex/AWS/Azure) ML.
- Prepare your environment for making HTTP requests.
- Send requests to the API endpoint using tools like Python’s requests library.
- Integrate the API calls into LM Studio or other tools.
- Test and validate the integration to ensure it functions correctly.
Method 2: Model conversion and export
Export Models: Export models from the cloud services in formats compatible with Ollama (e.g., ONNX, TensorFlow SavedModel). This might involve transferring the model files. Import into Ollama: If Ollama supports these formats, you can then import the models into Ollama’s environment.
References
- https://ollama.com
- https://github.com/ollama/ollama