24 minute read

Exploring Ollama & LM Studio

Exploring Ollama & LM Studio

Is this article for me?

If you are looking answers to the following questions, then this article is for you:

  • Question: What is Ollama? Is it like Docker?
  • Question: How is Ollama different from Docker?
  • Question: How to install ollama on my machine?
  • Question: How to create customized LLM Model (docker like image)?
  • Question: What are the LLM available on ollama?
  • Question: Can we integrate these hundreds with different UI like ChatGPT?
  • Question: If I want to use all these Ollama models via Jupyter Notebook then what to do?
  • Question: Does Ollama have plugins like github copilot? Can I use those from my visual code?
  • Question: What kind of software are LM Studio or Ollama?
  • Question: What is LM Studio and how different it is from Ollama?
  • Question: What are different formats to save model, specifically LLMs?
  • Question: What is gguf model extention?
  • Question: If I have finetuned my models using clouds like aws sagemaker, vertexai, azure and kept there then can I use them inside my ollama and LM Studio?

Question: What is Ollama? Is it like Docker?

Ollama is a platform designed to make running and interacting with large language models (LLMs) easier. It abstracts away the complexities of managing LLM models, GPU resources, and related configurations by offering a simple CLI interface. With Ollama, you can run, manage, and deploy LLMs locally or in various cloud environments without having to worry about the intricate details of setting up environments, downloading models, or configuring them.

Key Features of Ollama:

  • Model Management: Ollama can download and store LLMs in a local cache for you to run, typically in a format optimized for the hardware available (like your local GPU).
  • GPU/CPU Utilization: It detects hardware resources, such as your NVIDIA GPU, and automatically uses them for model acceleration without additional setup.
  • Service Setup: When you install Ollama, it sets up a service running in the background that serves models on an API, so you can interact with them programmatically.

Question: How is Ollama different from Docker?

While Ollama and Docker both deal with isolated environments, they serve different purposes:

  • Ollama focuses specifically on running machine learning models, especially large language models, and optimizes resources to make them easily accessible and deployable.
  • Docker is a general-purpose containerization tool that allows you to package applications with their dependencies in isolated environments. It’s used for deploying a wide variety of applications, not just models.

So, while Docker might also be used to set up machine learning environments or serve models, Ollama is specialized and optimized for the LLM use case.

In Summary: Ollama = Model management platform for LLMs, with easy CLI and automatic resource optimization. Docker = General containerization tool for deploying all types of applications in isolated environments.

Question: How to install ollama on my machine?

Refer: https://ollama.com/download/linux, and https://github.com/ollama/ollama, and https://github.com/ollama/ollama-python

  • To download Ollama on linux/wsl:
    curl -fsSL https://ollama.com/install.sh | sh
  • To run
    ollama run phi3
    http://127.0.0.1:11434/ - ollama is running

Question: How to create customized LLM Model (docker like image)?

If you know the working of Docker image, container, docker hub, docker command then you will feel at home with ollama commands.

Step 1: Create a ModelFile

FROM llama3.1

# set the temperature to 1 [higher is more creative, lower is more coherent]
PARAMETER temperature 1

# set the system message
SYSTEM """
You are Travel Advisor from Air India Airlines. Answer as AI Advisor, the assistant, only.
"""

Step 2: Create and run the model

ollama create aiadvisor -f ./Modelfile
ollama run aiadvisor
>>> hi
Hello! It's your friend AI Advisor.

Question: What are the LLM available on ollama?

There are 100+ LLM available via ollama. They have different capabilities in terms of domain task like coding, embedding, reasoning, chatting, philosophy, medical, maths, function calling. And in terms of context window 8k, 16k, 24k, 128k etc. And in terms of hardware/gpu required or not to run these.

Chatting/Assistant/

  1. alfred: A robust conversational model designed to be used for both chat and instruct use cases.
  2. all-minilm: Embedding models on very large sentence level datasets.
  3. An experimental 1.1B parameter model trained on the new Dolphin 2.8 dataset by Eric Hartford and based on TinyLlama.
  4. Aya 23, released by Cohere, is a new family of state-of-the-art, multilingual models that support 23 languages.
  5. bge-large: Embedding model from BAAI mapping texts to vectors.
  6. BGE-M3 is a new Embedding model from BAAI distinguished for its versatility in Multi-Functionality, Multi-Linguality, and Multi-Granularity.
  7. Command R is a Large Language Model optimized for conversational interaction and long context tasks.
  8. Command R+ is a powerful, scalable large language model purpose-built to excel at real-world enterprise use cases.
  9. DBRX is an open, general-purpose LLM created by Databricks.
  10. deepseek-llm: An advanced language model crafted with 2 trillion bilingual tokens.
  11. deepseek-v2: A strong, economical, and efficient Mixture-of-Experts language model.
  12. deepseek-v2.5: An upgraded version of DeekSeek-V2 that integrates the general and coding abilities of both DeepSeek-V2-Chat and DeepSeek-Coder-V2-Instruct.
  13. Dolphin 2.9 is a new model with 8B and 70B sizes by Eric Hartford based on Llama 3 that has a variety of instruction, conversational, and coding skills.
  14. dolphin-mixtral: Uncensored, 8x7b and 8x22b fine-tuned models based on the Mixtral mixture of experts models that excels at coding tasks. Created by Eric Hartford.
  15. everythinglm: Uncensored Llama2 based model with support for a 16K context window.
  16. falcon: A large language model built by the Technology Innovation Institute (TII) for use in summarization, text generation, and chat bots.
  17. Gemma is a family of lightweight, state-of-the-art open models built by Google DeepMind.
  18. glm4: A strong multi-lingual general language model with competitive performance to Llama 3.
  19. goliath: A language model created by combining two fine-tuned Llama 2 70B models into one.
  20. Google Gemma 2 is a high-performing and efficient model available in three sizes: 2B, 9B, and 27B.
  21. Hermes 3 is the latest version of the flagship Hermes series of LLMs by Nous Research
  22. Llama 2 is a collection of foundation language models ranging from 7B to 70B parameters.
  23. Llama 3.1 is a new state-of-the-art model from Meta available in 8B, 70B and 405B parameter sizes.
  24. llama2-chinese: Llama 2 based model fine tuned to improve Chinese dialogue ability.
  25. llama3-chatqa: A model from NVIDIA based on Llama 3 that excels at conversational question answering (QA) and retrieval-augmented generation (RAG).
  26. llama3-gradient: This model extends LLama-3 8B’s context length from 8k to over 1m tokens.
  27. MegaDolphin-2.2-120b is a transformation of Dolphin-2.2-70b created by interleaving the model with itself.
  28. Meta Llama 3: The most capable openly available LLM to date
  29. Mistral OpenOrca is a 7 billion parameter model, fine-tuned on top of the Mistral 7B model using the OpenOrca dataset.
  30. mistral-nemo: A state-of-the-art 12B model with 128k context length, built by Mistral AI in collaboration with NVIDIA.
  31. mistral-small: Mistral Small is a lightweight model designed for cost-effective use in tasks like translation and summarization.
  32. MistralLite is a fine-tuned model based on Mistral with enhanced capabilities of processing long contexts.
  33. mixtral: A set of Mixture of Experts (MoE) model with open weights by Mistral AI in 8x7b and 8x22b parameter sizes.
  34. neural-chat: A fine-tuned model based on Mistral with good coverage of domain and language.
  35. notus: A 7B chat model fine-tuned with high-quality data and based on Zephyr.
  36. notux: A top-performing mixture of experts model, fine-tuned with high-quality data.
  37. nous-hermes: General use models based on Llama and Llama 2 from Nous Research.
  38. nous-hermes2: The powerful family of models by Nous Research that excels at scientific discussion and coding tasks.
  39. nuextract: A 3.8B model fine-tuned on a private high-quality synthetic dataset for information extraction, based on Phi-3.
  40. OpenHermes 2.5 is a 7B model fine-tuned by Teknium on Mistral with fully open datasets.
  41. orca-mini: A general-purpose model ranging from 3 billion parameters to 70 billion, suitable for entry-level hardware.
  42. Phi-3 is a family of lightweight 3B (Mini) and 14B (Medium) state-of-the-art open models by Microsoft.
  43. phi3.5: A lightweight AI model with 3.8 billion parameters with performance overtaking similarly and larger sized models.
  44. Qwen 1.5 is a series of large language models by Alibaba Cloud spanning from 0.5B to 110B parameters
  45. Qwen2 is a new series of large language models from Alibaba group
  46. reader-lm: A series of models that convert HTML content to Markdown content, which is useful for content conversion tasks.
  47. samantha-mistral: A companion assistant trained in philosophy, psychology, and personal relationships. Based on Mistral.
  48. smollm: A family of small models with 135M, 360M, and 1.7B parameters, trained on a new high-quality dataset.
  49. solar: A compact, yet powerful 10.7B large language model designed for single-turn conversation.
  50. Stable LM 2 is a state-of-the-art 1.6B and 12B parameter language model trained on multilingual data in English, Spanish, German, Italian, French, Portuguese, and Dutch.
  51. stable-beluga: Llama 2 based model fine tuned on an Orca-style dataset. Originally called Free Willy.
  52. stablelm-zephyr: A lightweight chat model allowing accurate, and responsive output without requiring high-end hardware.
  53. Starling is a large language model trained by reinforcement learning from AI feedback focused on improving chatbot helpfulness.
  54. The Nous Hermes 2 model from Nous Research, now trained over Mixtral.
  55. The TinyLlama project is an open endeavor to train a compact 1.1B Llama model on 3 trillion tokens.
  56. vicuna: General use chat model based on Llama and Llama 2 with 2K to 16K context sizes.
  57. Wizard Vicuna is a 13B parameter model based on Llama 2 trained by MelodysDreamj.
  58. Wizard Vicuna Uncensored is a 7B, 13B, and 30B parameter model based on Llama 2 uncensored by Eric Hartford.
  59. wizardlm-uncensored: Uncensored version of Wizard LM model
  60. xwinlm: Conversational model based on Llama 2 that performs competitively on various benchmarks.
  61. yarn-llama2: An extension of Llama 2 that supports a context of up to 128k tokens.
  62. yarn-mistral: An extension of Mistral to support context windows of 64K or 128K.
  63. Yi 1.5 is a high-performing, bilingual language model.
  64. Zephyr is a series of fine-tuned versions of the Mistral and Mixtral models that are trained to act as helpful assistants.

Multimodal & Vision

  1. BakLLaVA is a multimodal (vision) model consisting of the Mistral 7B base model augmented with the LLaVA architecture.
  2. minicpm-v: A series of multimodal LLMs (MLLMs) designed for vision-language understanding.
  3. LLaVA is a novel end-to-end trained large multimodal model that combines a vision encoder and Vicuna for general-purpose visual and language understanding.
  4. llava-llama3: A LLaVA (vision) model fine-tuned from Llama 3 Instruct with better scores in several benchmarks.
  5. llava-phi3: A new small LLaVA (vision) model fine-tuned from Phi 3 Mini.
  6. moondream2 is a small vision language model designed to run efficiently on edge devices.

Math

  1. llama-pro: An expansion of Llama 2 that specializes in integrating both general language understanding and domain-specific knowledge, particularly in programming and mathematics.
  2. Qwen2 Math is a series of specialized math language models built upon the Qwen2 LLMs, which significantly outperforms the mathematical capabilities of open-source models and even closed-source models (e.g., GPT4o).
  3. wizard-math: Model focused on math and logic problems

Coding

  1. codellama: A large language model that can use text prompts to generate and discuss code.
  2. codegeex4: A versatile model for AI software development scenarios, including code completion.
  3. codeup: Great code generation model based on Llama2.
  4. codebooga: A high-performing code instruct model created by merging two existing code models.
  5. Magicoder is a family of 7B parameter models trained on 75K synthetic instruction data using OSS-Instruct, a novel approach to enlightening LLMs with open-source code snippets.
  6. wizardcoder: State-of-the-art code generation model
  7. phind-codellama: Code generation model based on Code Llama.
  8. dolphincoder: A 7B and 15B uncensored variant of the Dolphin model family that excels at coding, based on StarCoder2.
  9. granite-code: A family of open foundation models by IBM for Code Intelligence
  10. deepseek-coder-v2: An open-source Mixture-of-Experts code language model that achieves performance comparable to GPT4-Turbo in code-specific tasks.
  11. SQLCoder is a code completion model fined-tuned on StarCoder for SQL generation tasks
  12. StarCoder is a code generation model trained on 80+ programming languages.
  13. Yi-Coder is a series of open-source code language models that delivers state-of-the-art coding performance with fewer than 10 billion parameters.
  14. Codestral is Mistral AI’s first-ever code model designed for code generation tasks.
  15. Falcon2 is an 11B parameters causal decoder-only model built by TII and trained over 5T tokens.
  16. Stable Code 3B is a coding model with instruct and code completion variants on par with models such as Code Llama 7B that are 2.5x larger.
  17. StarCoder2 is the next generation of transparently trained open code LLMs that comes in three sizes: 3B, 7B and 15B parameters.
  18. DeepSeek Coder is a capable coding model trained on two trillion code and natural language tokens.
  19. CodeQwen1.5 is a large language model pretrained on a large amount of code data.
  20. Mistral Large 2 is Mistral’s new flagship model that is significantly more capable in code generation, mathematics, and reasoning with 128k context window and support for dozens of languages.
  21. open-orca-platypus2: Merge of the Open Orca OpenChat model and the Garage-bAInd Platypus 2 model. Designed for chat and code generation.
  22. CodeGemma is a collection of powerful, lightweight models that can perform a variety of coding tasks like fill-in-the-middle code completion, code generation, natural language understanding, mathematical reasoning, and instruction following.

Embedding

  1. nomic-embed-text: A high-performing open embedding model with a large token context window.
  2. mxbai-embed-large
  3. snowflake-arctic-embed: A suite of text embedding models by Snowflake, optimized for performance.
  4. State-of-the-art large embedding model from mixedbread.ai
  5. paraphrase-multilingual: Sentence-transformers (embedding) model that can be used for tasks like clustering or semantic search.

Medical

  1. medllama2: Fine-tuned Llama 2 model to answer medical questions based on an open source medical dataset.
  2. meditron: Open-source medical large language model adapted from Llama 2 to the medical domain.

Function Calling

  1. Nexus Raven is a 13B instruction tuned model for function calling tasks.
  2. llama3-groq-tool-use: A series of models from Groq that represent a significant advancement in open-source AI capabilities for tool use/function calling.
  3. firefunction-v2: An open weights function calling model based on Llama 3, competitive with GPT-4o function calling capabilities.

Reasoning

  1. mathstral: MathΣtral: a 7B model designed for math reasoning and scientific discovery by Mistral AI.
  2. Phi-2: a 2.7B language model by Microsoft Research that demonstrates outstanding reasoning and language understanding capabilities.
  3. InternLM2.5 is a 7B parameter model tailored for practical scenarios with outstanding reasoning capability.
  4. wizardlm2: State of the art large language model from Microsoft AI with improved performance on complex chat, multilingual, reasoning and agent use cases.
  5. reflection: A high-performing model trained with a new technique called Reflection-tuning that teaches a LLM to detect mistakes in its reasoning and correct course.
  6. Orca 2 is built by Microsoft research, and are a fine-tuned version of Meta’s Llama 2 models. The model is designed to excel particularly in reasoning.

Question: Can we integrate these hundreds with different UI like ChatGPT?

Yes, in fact you need NOT to create any new UI. Hundreds of good UI are available which are integrated with these hundreds of LLMs available on Ollama. You can see below some of the popular UI via which Ollama models can be accessed.

For mobile UI, you can explore these resources.

  • Enchanted
  • Maid
  • ConfiChat (Lightweight, standalone, multi-platform, and privacy focused LLM chat interface with optional encryption)

Question: If I want to use all these Ollama models via Jupyter Notebook then what to do?

There are dozens of libraries which integrate ollama models. You can pip install and use these libraries in your python code. Some of these popular libraries are:

Question: Does Ollama have plugins like github copilot? Can I use those from my visual code?

Yes, there are many plugins like that for different purpose apart from coding. Even for the coding there are dozens of plugin available with different capabilities. And you need not to pay for these plugins like you have to pay monthly to Micorosoft! Some of those plugins are :

Question: What kind of software are LM Studio or Ollama?

Their role is to facilitate easy use of these models by providing a platform that supports multiple models, offering features like local deployment, training, and experimentation without needing to deal with the complex setup each model requires.

They are platform or interface for LLM: Both Ollama and LM Studio are software tools that allow users to interact with, run, fine-tune, and experiment with multiple large language models. They are more like model management tools or LLM execution environments rather than models themselves.

They are Model Hub: These tools serve as hubs where you can load, execute, and work with a variety of pre-trained LLMs. Instead of being limited to one specific model like GPT-4, they allow users to work with models like LLaMA, GPT-3, GPT-4, and others.

They are Model Runners: Since they enable the running and execution of multiple models.

They are LLM Execution/Management Tools: They manage various models and allow you to deploy them.

You can think of them as infrastructure that abstracts away the complexities of working with different LLMs.

Question: What is LM Studio and how different it is from Ollama?

Ollama is designed for ease of use and running pre-trained models locally, perfect for developers and non-technical users who prioritize simplicity, privacy, and API-based integration. LM Studio is a research-oriented tool for AI engineers and researchers who need in-depth control over model fine-tuning, training, and experimentation, with a steeper learning curve but greater flexibility.

Here’s a detailed comparison of Ollama and LM Studio in terms of their capabilities:

Feature/Capability Ollama LM Studio
Primary Purpose Run and manage multiple LLMs locally Research, experimentation, fine-tuning, and training of LLMs
Supported Models LLaMA, GPT-3, GPT-4, and other popular LLMs LLaMA, GPT, custom LLMs, and more, with a focus on fine-tuning
Local Model Execution Supports running models locally without cloud dependencies Allows for local execution, including training and experimentation
Model Fine-tuning/Training No, typically runs pre-trained models Yes, built for fine-tuning LLMs on custom datasets
Experimentation Tools Minimal experimentation features, more focused on simple deployment Extensive tools for experimenting with models, datasets, and hyperparameters
Ease of Use Simple, user-friendly interface for non-technical users More advanced, with a steeper learning curve but richer in functionality for researchers
Hardware Requirements Optimized for running on GPUs or CPUs locally Requires higher-end hardware (GPUs) for fine-tuning and training
Privacy Strong privacy due to local model execution Supports local execution for privacy, but also scales to cloud-based setups
API Integration Yes, offers APIs to integrate LLMs into custom applications Primarily a standalone platform, with some ability for integration into workflows
Cloud Integration Primarily local execution; not designed for cloud-based workflows Supports both local and cloud-based training environments, useful for large-scale training
Model Deployment Can be deployed locally or integrated via API into applications Typically used for experimentation, research, and training, with some deployment capabilities
Pre-Trained Models Easy access to pre-trained models (LLaMA, GPT, etc.) Access to a variety of pre-trained models (hugging face and others), with emphasis on customization and fine-tuning
Target Audience Developers, non-technical users who want easy local LLM access AI researchers, developers, engineers who require deeper control and experimentation
Community & Support Developer-focused community Strong research community with contributions from AI developers

Key Capabilities of Ollama:

  1. Run LLMs Locally: Focuses on running pre-trained models such as GPT-3, GPT-4, and LLaMA on your local machine without requiring cloud dependencies.
  2. Simple Setup: Aimed at developers and non-technical users who want easy access to LLMs.
  3. Privacy & Security: Since models run locally, no data is sent to external servers, enhancing privacy.
  4. API Integration: Provides APIs to integrate models into applications, making it useful for local deployment.
  5. Resource Optimization: Automatically manages local system resources, including CPU and GPU, to run models efficiently.

Key Capabilities of LM Studio:

  1. Fine-Tuning LLMs: Supports fine-tuning pre-trained LLMs on custom datasets, ideal for research and development.
  2. Model Training: Enables the training of LLMs from scratch or with specific hyperparameter configurations.
  3. Advanced Experimentation: Provides tools to run experiments, tweak models, and monitor results for research purposes.
  4. Customizable Infrastructure: Gives more control over hardware resources and configuration, allowing for scaling on cloud or local machines.
  5. Open-Source Platform: Built for researchers, it has a rich ecosystem of community-driven features and tools.

Question: What are different formats to save model, specifically LLMs?

Large language models (LLMs) can be stored in various formats, each suited for different purposes and platforms. These formats cater to different needs, from interoperability between frameworks (ONNX) to specific hardware optimizations (OpenVINO, TensorFlow Lite). The choice of format depends on the specific requirements of the deployment environment and the tools being used.

Here are some common model formats used for LLMs:

1. PyTorch (.pt, .pth)

  • Description: Files with .pt or .pth extensions are commonly used to store PyTorch models. These files contain the model’s weights and architecture.
  • Usage: Typically used with PyTorch frameworks for loading and running models.
  • Example: Models saved using torch.save(model.state_dict(), 'model.pth').

2. TensorFlow (.pb, .h5, .tf)

  • Description: TensorFlow models can be saved in multiple formats:
    • .pb (Protocol Buffers): Used for saving the complete model, including weights and architecture.
    • .h5 (HDF5): Used for saving models in Keras (which is a high-level API for TensorFlow).
    • .tf: Used for saving TensorFlow models in the SavedModel format.
  • Usage: Used with TensorFlow for model deployment and inference.
  • Example: Models saved using model.save('model.h5') or tf.saved_model.save(model, 'saved_model').

3. ONNX (.onnx)

  • Description: Open Neural Network Exchange (ONNX) is a format for representing deep learning models. It allows interoperability between different deep learning frameworks.
  • Usage: Enables models trained in one framework (like PyTorch) to be used in another (like TensorFlow).
  • Example: Models converted to ONNX using torch.onnx.export(model, inputs, 'model.onnx').

4. OpenVINO (.bin, .xml)

  • Description: OpenVINO uses .bin and .xml files to represent optimized models for Intel hardware.
  • Usage: Provides acceleration for inference on Intel devices.
  • Example: Models optimized with OpenVINO are stored in .xml (model structure) and .bin (weights) files.

5. GGUF (.gguf)

  • Description: Generalized Graph Universal Format (GGUF) is a format used by Meta for storing LLaMA models. It provides a standardized way to store and share large language models.
  • Usage: Specifically designed for LLaMA models but can be used more broadly for LLMs.
  • Example: Models saved in GGUF format will have the .gguf file extension.

6. SavedModel (SavedModel)

  • Description: TensorFlow’s SavedModel format includes a directory with serialized model weights, graph definitions, and metadata.
  • Usage: TensorFlow’s recommended format for serving models in production.
  • Example: SavedModel format directory includes files like saved_model.pb and a variables directory.

7. Core ML (.mlmodel)

  • Description: Apple’s Core ML format is used for deploying models on iOS, macOS, watchOS, and tvOS.
  • Usage: Used for integrating machine learning models into Apple applications.
  • Example: Models converted to Core ML using tools like coremltools.

8. TensorFlow Lite (.tflite)

  • Description: A format for deploying TensorFlow models on mobile and edge devices. It provides a smaller, more efficient representation of the model.
  • Usage: Optimized for mobile and embedded devices.
  • Example: Models converted to TensorFlow Lite format using tf.lite.TFLiteConverter.

9. Hugging Face (.bin, config.json, tokenizer.json)

  • Description: Hugging Face models typically use .bin files for weights and JSON files for configuration and tokenizers. This format is often associated with the Transformers library.
  • Usage: Used with Hugging Face’s Transformers library for loading and fine-tuning models.
  • Example: Models from Hugging Face’s model hub include .bin files for weights and configuration files.

10. Hugging Face (.safetensors)

  • Description: SafeTensors (developed recently by hugging face) is a format developed for safely and efficiently storing tensor data, particularly for large language models. It aims to provide secure and consistent handling of model weights.
  • Usage: Designed to improve safety and integrity in model storage by addressing issues related to file corruption and ensuring the integrity of the model data. It’s increasingly used in machine learning and AI communities for its security benefits.
  • Key Features:
    • Safety: Ensures data integrity and helps prevent corruption.
    • Efficiency: Optimized for storage and retrieval of large model weights.
    • Compatibility: Designed to be used with various frameworks and tools that support tensor-based models.

Summary of Model Formats Including SafeTensors:

  1. PyTorch (.pt, .pth)
  2. TensorFlow (.pb, .h5, .tf)
  3. ONNX (.onnx)
  4. Hugging Face (.bin, config.json, tokenizer.json)
  5. GGUF (.gguf)
  6. SavedModel (SavedModel)
  7. Core ML (.mlmodel)
  8. TensorFlow Lite (.tflite)
  9. OpenVINO (.bin, .xml)
  10. SafeTensors (.safetensors)

Question: What is gguf model extention?

The GGUF (Generalized Graph Universal Format) is designed to provide a standardized format for storing and sharing large language models. It aims to facilitate the interoperability of models across different platforms and tools. GGUF is particularly associated with Meta’s LLaMA (Large Language Model Meta AI) series of models. It is used for representing the weights and configurations of these models in a way that can be easily loaded and utilized across different environments.

GGUF format aims:

  • Standardization: GGUF aims to standardize how model data is stored and exchanged, making it easier to work with LLaMA models and potentially other models that adopt this format.
  • Efficiency: The format is designed to efficiently handle the large size of modern language models, ensuring that models can be loaded and processed quickly.

Question: If I have finetuned my models using clouds like aws sagemaker, vertexai, azure and kept there then can I use them inside my ollama and LM Studio?

Yes, we can use them

Method 1: API integration

  • Obtain the endpoint URL and API key from cloud platform (Vertex/AWS/Azure) ML.
  • Prepare your environment for making HTTP requests.
  • Send requests to the API endpoint using tools like Python’s requests library.
  • Integrate the API calls into LM Studio or other tools.
  • Test and validate the integration to ensure it functions correctly.

Method 2: Model conversion and export
Export Models: Export models from the cloud services in formats compatible with Ollama (e.g., ONNX, TensorFlow SavedModel). This might involve transferring the model files. Import into Ollama: If Ollama supports these formats, you can then import the models into Ollama’s environment.

References

  • https://ollama.com
  • https://github.com/ollama/ollama

Updated: