quantization schemas comparison

Q5_0 (Integer Quantization with Zero)

  • Precision: 5 bits
  • Range: -32 to 31
  • Conversion formula:
    • f_i = ⌊f / 2^5⌋
  • Characteristics:
    • Simple and fast, but may lead to rounding errors.
    • Suitable for applications where high precision is not required.

Q5_1 (Integer Quantization with One)

  • Precision: 5 bits
  • Range: -32 to 31
  • Conversion formula:
    • f_i = ⌊(s + f) / 2^5⌋ if s is positive; 0 otherwise (where s is the sign bit)
  • Characteristics:
    • Similar to Q5_0, but with a single-bit representation for the sign bit.
    • Provides better accuracy than Q5_0.

Q5_K_S (Integer Quantization with Sign-Switched)

  • Precision: 5 bits
  • Range: -32 to 31
  • Conversion formula:
    • f_i = ⌊(s + (-2^4)f) / 2^5⌋ if s is positive; 0 otherwise (where s is the sign bit)
  • Characteristics:
    • Uses a sign-switched representation, which can lead to better accuracy.
    • Suitable for applications where high precision is required.

Q5_K_M (Integer Quantization with K-Means)

  • Precision: 5 bits
  • Range: -32 to 31
  • Conversion formula:
    • f_i = ⌊(s + (-2^4)f) / 2^5⌋ if s is positive; 0 otherwise (where s is the sign bit)
  • Characteristics:
    • Uses K-Means clustering to determine the optimal quantization values for each weight.
    • Provides better accuracy than Q5_K_S.

Q6_K (Integer Quantization with K-Means)

  • Precision: 6 bits
  • Range: -32 to 31
  • Conversion formula:
    • f_i = ⌊(s + (-2^5)f) / 2^6⌋ if s is positive; 0 otherwise (where s is the sign bit)
  • Characteristics:
    • Similar to Q5_K_M, but with a higher precision.
    • Provides better accuracy than Q5_K_S.

Q8_0 (Integer Quantization with Zero)

  • Precision: 8 bits
  • Range: -128 to 127
  • Conversion formula:
    • f_i = ⌊(s + (-2^7)f) / 2^8⌋ if s is positive; 0 otherwise (where s is the sign bit)
  • Characteristics:
    • Simple and fast, but may lead to rounding errors.
    • Suitable for applications where high precision is not required.

Comparison summary:

SchemePrecision (bits)RangeConversion Formula
Q5_05-32 to 31f_i = ⌊f / 2^5⌋
Q5_15-32 to 31f_i = ⌊(s + f) / 2^5⌋ if s is positive; 0 otherwise
Q5_K_S5-32 to 31f_i = ⌊(s + (-2^4)f) / 2^5⌋ if s is positive; 0 otherwise
Q5_K_M5-32 to 31f_i = ⌊(s + (-2^4)f) / 2^5⌋ if s is positive; 0 otherwise
Q6_K6-32 to 31f_i = ⌊(s + (-2^5)f) / 2^6⌋ if s is positive; 0 otherwise
Q8_08-128 to 127f_i = ⌊(s + (-2^7)f) / 2^8⌋ if s is positive; 0 otherwise

When choosing between these schemes, consider the following factors:

  • Accuracy: Q6_K and Q5_K_M provide better accuracy than Q5_0 and Q5_1.
  • Memory efficiency: Q5_1 and Q5_K_S are more memory-efficient than Q5_0.
  • Complexity: Q5_K_M and Q6_K require slightly more computation than the other schemes.

Ultimately, the choice of quantization scheme depends on the specific application requirements and trade-offs between accuracy, memory efficiency, and computational complexity.

FP16 ?

Here are some general guidelines:

  • Use FP16 when:
    • High precision is required.
    • Memory usage is not a concern (FP16 requires more memory than Q6_K or Q8_0).
  • Use Q6_K when:
    • A good balance between accuracy and memory efficiency is desired.
    • Computational complexity is not a major concern.
  • Use Q8_0 when:
    • High performance is required, and precision is not critical.
    • Memory usage is a concern (Q8_0 requires less memory than Q6_K or FP16).