TheBloke/Llama-2-13B-chat-GPTQ

  • Modal Card
  • Files & version

DESCRIPTION.md

Llama 2 13B Chat - GPTQ

Model creator: Meta Llama 2
Original model: Llama 2 13B Chat

Description

This repository contains GPTQ model files for Meta's Llama 2 13B-chat. Multiple GPTQ parameter permutations are provided; see the provided files below for details of the options, their parameters, and the software used to create them.

GPTQ

GPTQ is a technique used to optimise the performance of large language models by reducing their precision. This quantisation process decreases the model size and the computational resources required for inference, enabling faster and more efficient deployment without significantly compromising the model's accuracy. By providing multiple quantisation parameter options, GPTQ allows users to choose the best configuration based on their hardware capabilities and specific needs.

Provided Files and GPTQ Parameters

Multiple quantisation parameters are provided to allow you to choose the best one for your hardware and requirements.

Explanation of GPTQ Parameters:

BranchBitsGSAct OrderDamp %GPTQ DatasetSeq LenSizeExLlamaDesc
main4128No0.01wikitext40967.26 GBYes4-bit, without Act Order and group size 128g.
gptq-4bit-32g-actorder_True432Yes0.01wikitext40968.00 GBYes4-bit, with Act Order and group size 32g. Highest inference quality.
gptq-4bit-64g-actorder_True464Yes0.01wikitext40967.51 GBYes4-bit, with Act Order and group size 64g. Less VRAM, slightly lower accuracy.
gptq-4bit-128g-actorder_True4128Yes0.01wikitext40967.26 GBYes4-bit, with Act Order and group size 128g. Less VRAM, slightly lower accuracy.
gptq-8bit-128g-actorder_True8128Yes0.01wikitext409613.65 GBNo8-bit, with group size 128g for higher inference quality and with Act Order for even higher accuracy.
gptq-8bit-64g-actorder_True864Yes0.01wikitext409613.95 GBNo8-bit, with group size 64g and Act Order for even higher inference quality.
gptq-8bit-128g-actorder_False8128No0.01wikitext409613.65 GBNo8-bit, with group size 128g for higher inference quality without Act Order to improve AutoGPTQ speed.
gptq-8bit--1g-actorder_True8NoneYes0.01wikitext409613.36 GBNo8-bit, with Act Order. No group size, to lower VRAM requirements.

Best Use Cases for Llama 2 13B Chat Model

Llama 2 models are decoder-only models that excel in generating coherent, contextually relevant text. The 13B Chat model is particularly optimized for dialogue and conversational AI applications. Key use cases include:

  • Customer Support: Automated responses to common customer queries.
  • Virtual Assistants: Providing helpful and contextually aware information.
  • Content Creation: Assisting in drafting articles, emails, and other text content.
  • Educational Tools: Offering tutoring and answering questions in educational platforms.

The model's architecture and training make it well-suited for any task requiring understanding and generating human-like text, especially in interactive and real-time applications. The fine-tuned chat models outperform many open-source chat models and are competitive with some popular closed-source models in terms of helpfulness and safety.

Credit

This model is quantised by TheBloke.