I apologize for the oversight. Here is the updated
DESCRIPTION.md
Model creator: Mistral AI
Original model: Mistral 7B Instruct v0.2
This repository contains GPTQ model files for Mistral AI's Mistral 7B Instruct v0.2. Multiple GPTQ parameter permutations are provided; see the provided files below for details of the options, their parameters, and the software used to create them. These files were quantized using hardware kindly provided by Massed Compute.
Explanation of GPTQ Parameters:
Branch | Bits | GS | Act Order | Damp % | GPTQ Dataset | Seq Len | Size | ExLlama | Desc |
---|---|---|---|---|---|---|---|---|---|
main | 4 | 128 | Yes | 0.1 | VMware Open Instruct | 4096 | 4.16 GB | Yes | 4-bit, with Act Order and group size 128g. Uses less VRAM than 64g, slightly lower accuracy. |
gptq-4bit-32g-actorder_True | 4 | 32 | Yes | 0.1 | VMware Open Instruct | 4096 | 4.57 GB | Yes | 4-bit, with Act Order and group size 32g. Highest possible inference quality, maximum VRAM usage. |
gptq-8bit--1g-actorder_True | 8 | None | Yes | 0.1 | VMware Open Instruct | 4096 | 7.52 GB | No | 8-bit, with Act Order. No group size, to lower VRAM requirements. |
gptq-8bit-128g-actorder_True | 8 | 128 | Yes | 0.1 | VMware Open Instruct | 4096 | 7.68 GB | No | 8-bit, with group size 128g for higher inference quality and with Act Order for even higher accuracy. |
gptq-8bit-32g-actorder_True | 8 | 32 | Yes | 0.1 | VMware Open Instruct | 4096 | 8.17 GB | No | 8-bit, with group size 32g and Act Order for maximum inference quality. |
gptq-4bit-64g-actorder_True | 4 | 64 | Yes | 0.1 | VMware Open Instruct | 4096 | 4.29 GB | Yes | 4-bit, with Act Order and group size 64g. Uses less VRAM than 32g, but with slightly lower accuracy. |
Mistral 7B Instruct v0.2 models are optimized for various natural language understanding and generation tasks. The GPTQ quantisation enhances their efficiency, making them suitable for:
The model's architecture and training make it well-suited for tasks requiring understanding and generating human-like text, especially in interactive and real-time applications.
Mistral 7B Instruct v0.2 is a decoder-only model based on an optimized transformer architecture. It uses Grouped-Query Attention (GQA) and Sliding-Window Attention to enhance performance. The model also employs a Byte-Fallback BPE tokenizer, making it efficient for various text generation tasks. The model has been fine-tuned for instruction-based tasks, making it ideal for generating informative and contextually relevant responses.
GPTQ (Quantization Parameter Tuning for Generative Pre-trained Transformers) is a technique used to optimize the performance of large language models by reducing their precision. This quantization process decreases the model size and the computational resources required for inference, enabling faster and more efficient deployment without significantly compromising the model's accuracy. GPTQ provides multiple quantization parameter options, allowing users to choose the best configuration based on their hardware capabilities and specific needs.
This model is quantised by TheBloke