Understanding Transformers

Post by **Antonio Linares** » Sat Jul 06, 2024 10:10 am

Transformers are described in the document "attention is all you need" and are the architecture used by AI large language models (chatgpt, etc):

Code: Select all | Expand

FUNCTION Main()

    LOCAL aEmbeddings, aWq, aWk, aWv, aBq, aBk, aBv
    LOCAL aQ, aK, aV
    LOCAL aAttentionScores, aOutput

    // Simulamos embeddings y matrices de peso (normalmente serían cargados o generados)
    aEmbeddings := GenerateRandomMatrix(3, 4)  // [batch_size, seq_len, d_model]
    aWq := GenerateRandomMatrix(4, 2)  // [d_model, d_k]
    aWk := GenerateRandomMatrix(4, 2)
    aWv := GenerateRandomMatrix(4, 2)
    aBq := GenerateRandomVector(2)  // [d_k]
    aBk := GenerateRandomVector(2)
    aBv := GenerateRandomVector(2)
    
    ? aEmbeddings
    
    // Realizamos las transformaciones lineales
    aQ := LinearTransformation(aEmbeddings, aWq, aBq)
    aK := LinearTransformation(aEmbeddings, aWk, aBk)
    aV := LinearTransformation(aEmbeddings, aWv, aBv)
    
    // Calculamos las puntuaciones de atención
    aAttentionScores := CalculateAttentionScores(aQ, aK)
    
    // Aplicamos las puntuaciones de atención a los valores
    aOutput := ApplyAttention(aAttentionScores, aV)
    
    // Imprimimos los resultados
    ? "Query:", aQ
    ? "Key:", aK
    ? "Value:", aV
    ? "Attention Scores:", aAttentionScores
    ? "Output:", aOutput

RETURN NIL

FUNCTION LinearTransformation(aX, aW, aB)
    LOCAL aResult, i, j, k, nSum
    LOCAL nRows := Len(aX), nCols := Len(aW[1]), nInner := Len(aW)
    
    aResult := Array(nRows)
    FOR i := 1 TO nRows
        aResult[i] := Array(nCols)
        FOR j := 1 TO nCols
            nSum := 0
            FOR k := 1 TO nInner
                nSum += aX[i][k] * aW[k][j]
            NEXT
            aResult[i][j] := nSum + aB[j]
        NEXT
    NEXT
    
RETURN aResult

FUNCTION GenerateRandomMatrix(nRows, nCols)
    LOCAL aMatrix := Array(nRows, nCols), i, j
    FOR i := 1 TO nRows
        FOR j := 1 TO nCols
            aMatrix[i,j] := hb_Random(-1, 1)
        NEXT    
    NEXT
RETURN aMatrix

FUNCTION GenerateRandomVector(nSize)
    LOCAL aVector := Array(nSize), i
    FOR i := 1 TO nSize
        aVector[i] := hb_Random(-1, 1)
    NEXT
RETURN aVector

FUNCTION CalculateAttentionScores(aQ, aK)
    LOCAL aScores, i, j, k, nSum, nExpSum
    LOCAL nRowsQ := Len(aQ), nColsQ := Len(aQ[1])
    LOCAL nRowsK := Len(aK), nColsK := Len(aK[1])
    
    // aQ y aK deben tener el mismo número de columnas (d_k)
    IF nColsQ <> nColsK
        ? "Error: Las dimensiones de aQ y aK no coinciden"
        RETURN NIL
    ENDIF
    
    aScores := Array(nRowsQ, nRowsK)
    FOR i := 1 TO nRowsQ
        FOR j := 1 TO nRowsK
            nSum := 0
            FOR k := 1 TO nColsQ
                nSum += aQ[i][k] * aK[j][k]
            NEXT
            aScores[i][j] := nSum / Sqrt(nColsQ)  // Escalado de las puntuaciones de atención
        NEXT
    NEXT
    
    // Aplicamos la normalización softmax
    FOR i := 1 TO nRowsQ
        nExpSum := 0
        FOR j := 1 TO nRowsK
            aScores[i][j] := Exp(aScores[i][j])
            nExpSum += aScores[i][j]
        NEXT
        FOR j := 1 TO nRowsK
            aScores[i][j] /= nExpSum
        NEXT
    NEXT
    
RETURN aScores

FUNCTION ApplyAttention(aScores, aV)
    LOCAL aOutput, i, j, k, nSum
    LOCAL nRows := Len(aScores), nCols := Len(aV[1]), nInner := Len(aV)
    
    aOutput := Array(nRows, nCols)
    FOR i := 1 TO nRows
        FOR j := 1 TO nCols
            nSum := 0
            FOR k := 1 TO nInner
                nSum += aScores[i][k] * aV[k][j]
            NEXT
            aOutput[i][j] := nSum
        NEXT
    NEXT
    
RETURN aOutput

{{-0.20, -0.33, -0.13, 0.75}, {0.56, 0.31, 0.19, -0.09}, {-0.26, 0.48, 0.73, -0.32}}
Query: {{0.6859, -0.0584}, {1.3492, 0.9291}, {1.0082, 1.1412}}
Key: {{0.3594, 1.1780}, {1.0069, 1.3886}, {0.8579, 0.6985}}
Value: {{-0.2781, -0.6665}, {-1.0988, 0.3276}, {-0.3004, 0.3100}}
Attention Scores: {{0.27, 0.37, 0.36}, {0.23, 0.49, 0.27}, {0.26, 0.49, 0.25}}
Output: {{-0.590643, 0.049439}, {-0.690302, 0.091827}, {-0.684619, 0.064918}}

Post by **Antonio Linares** » Sat Jul 06, 2024 10:51 am

Implementing multi heads support:

Code: Select all | Expand

FUNCTION Main()

    LOCAL aInput, aPositionalEncoding, aEncoderOutput
    LOCAL nBatchSize := 2, nSeqLen := 5, nModelDim := 8, nHeads := 2

    // Generar input de ejemplo
    aInput := GenerateRandomMatrix(nBatchSize, nSeqLen, nModelDim)

    // Generar codificación posicional
    aPositionalEncoding := GeneratePositionalEncoding(nSeqLen, nModelDim)

    // Añadir codificación posicional al input
    aInput := AddPositionalEncoding(aInput, aPositionalEncoding)

    // Crear y aplicar el encoder
    aEncoderOutput := TransformerEncoder(aInput, nHeads, 2) // 2 capas de encoder

    ? "Input con codificación posicional:"
    PrintMatrix(aInput)
    ? "Salida del Encoder:"
    PrintMatrix(aEncoderOutput)

RETURN NIL

FUNCTION TransformerEncoder(aInput, nHeads, nLayers)
    LOCAL aOutput := aInput
    LOCAL i

    FOR i := 1 TO nLayers
        // Multi-Head Attention
        aOutput := AddAndNorm(aOutput, MultiHeadAttention(aOutput, nHeads))
        
        // Feed Forward
        aOutput := AddAndNorm(aOutput, FeedForward(aOutput))
    NEXT

RETURN aOutput

FUNCTION MultiHeadAttention(aInput, nHeads)
    LOCAL aOutputs := {}, aFinalOutput, i
    LOCAL nBatchSize := Len(aInput), nSeqLen := Len(aInput[1]), nModelDim := Len(aInput[1, 1])
    LOCAL nHeadDim := Int(nModelDim / nHeads)
    LOCAL aWq, aWk, aWv, aQ, aK, aV, aAttentionScores, aHeadOutput, aWo
    
    FOR i := 1 TO nHeads
        aWq := GenerateRandomMatrix(nModelDim, nHeadDim)
        aWk := GenerateRandomMatrix(nModelDim, nHeadDim)
        aWv := GenerateRandomMatrix(nModelDim, nHeadDim)
        
        aQ := LinearTransformation(aInput, aWq)
        aK := LinearTransformation(aInput, aWk)
        aV := LinearTransformation(aInput, aWv)
        
        aAttentionScores := CalculateAttentionScores(aQ, aK)
        aHeadOutput := ApplyAttention(aAttentionScores, aV)
        
        AAdd(aOutputs, aHeadOutput)
    NEXT
    
    aFinalOutput := ConcatenateOutputs(aOutputs)
    
    aWo := GenerateRandomMatrix(nModelDim, nModelDim)
    aFinalOutput := LinearTransformation(aFinalOutput, aWo)

RETURN aFinalOutput

FUNCTION FeedForward(aInput)
    LOCAL nBatchSize := Len(aInput), nSeqLen := Len(aInput[1]), nModelDim := Len(aInput[1, 1])
    LOCAL nFfDim := nModelDim * 4 // Típicamente, la dimensión interna es 4 veces la dimensión del modelo
    
    LOCAL aW1 := GenerateRandomMatrix(nModelDim, nFfDim)
    LOCAL aW2 := GenerateRandomMatrix(nFfDim, nModelDim)
    
    LOCAL aHidden := LinearTransformation(aInput, aW1), aOutput
    aHidden := ApplyReLU(aHidden)
    aOutput := LinearTransformation(aHidden, aW2)

RETURN aOutput

FUNCTION AddAndNorm(aInput, aResidual)
    LOCAL aSum := AddMatrices(aInput, aResidual)
    LOCAL aNormalized := LayerNorm(aSum)
RETURN aNormalized

FUNCTION LayerNorm(aInput)
    LOCAL nBatchSize := Len(aInput), nSeqLen := Len(aInput[1]), nModelDim := Len(aInput[1, 1])
    LOCAL aOutput := Array(nBatchSize, nSeqLen, nModelDim)
    LOCAL i, j, k, nMean, nVariance, nEpsilon := (1 * 10^-5)
    
    FOR i := 1 TO nBatchSize
        FOR j := 1 TO nSeqLen
            nMean := CalcMean(aInput[i, j])
            nVariance := CalcVariance(aInput[i, j], nMean)
            
            FOR k := 1 TO nModelDim
                aOutput[i, j, k] := (aInput[i, j, k] - nMean) / Sqrt(nVariance + nEpsilon)
            NEXT
        NEXT
    NEXT

RETURN aOutput

FUNCTION GeneratePositionalEncoding(nSeqLen, nModelDim)
    LOCAL aEncoding := Array(nSeqLen, nModelDim)
    LOCAL i, j, nPos, nI
    
    FOR i := 1 TO nSeqLen
        FOR j := 1 TO nModelDim
            nPos := i - 1
            nI := j - 1
            IF nI % 2 == 0
                aEncoding[i, j] := Sin(nPos / (10000 ** (nI / nModelDim)))
            ELSE
                aEncoding[i, j] := Cos(nPos / (10000 ** ((nI - 1) / nModelDim)))
            ENDIF
        NEXT
    NEXT

RETURN aEncoding

FUNCTION AddPositionalEncoding(aInput, aPositionalEncoding)
    LOCAL nBatchSize := Len(aInput), nSeqLen := Len(aInput[1]), nModelDim := Len(aInput[1, 1])
    LOCAL aOutput := Array(nBatchSize, nSeqLen, nModelDim)
    LOCAL i, j, k
    
    FOR i := 1 TO nBatchSize
        FOR j := 1 TO nSeqLen
            FOR k := 1 TO nModelDim
                aOutput[i, j, k] := aInput[i, j, k] + aPositionalEncoding[j, k]
            NEXT
        NEXT
    NEXT

RETURN aOutput

FUNCTION LinearTransformation(aX, aW)
    LOCAL aResult, i, j, k, nSum
    LOCAL nBatchSize := Len(aX), nSeqLen := Len(aX[1])
    LOCAL nInDim := Len(aX[1, 1]), nOutDim := Len(aW[1])
    
    aResult := Array(nBatchSize, nSeqLen, nOutDim)
    FOR i := 1 TO nBatchSize
        FOR j := 1 TO nSeqLen
            FOR k := 1 TO nOutDim
                nSum := 0
                FOR l := 1 TO nInDim
                    nSum += aX[i, j, l] * aW[l, k]
                NEXT
                aResult[i, j, k] := nSum
            NEXT
        NEXT
    NEXT

RETURN aResult

FUNCTION CalculateAttentionScores(aQ, aK)
    LOCAL aScores, i, j, k, l, nSum
    LOCAL nBatchSize := Len(aQ), nSeqLen := Len(aQ[1]), nDimK := Len(aQ[1, 1])
    
    aScores := Array(nBatchSize, nSeqLen, nSeqLen)
    FOR i := 1 TO nBatchSize
        FOR j := 1 TO nSeqLen
            FOR k := 1 TO nSeqLen
                nSum := 0
                FOR l := 1 TO nDimK
                    nSum += aQ[i, j, l] * aK[i, k, l]
                NEXT
                aScores[i, j, k] := nSum / Sqrt(nDimK)
            NEXT
        NEXT
    NEXT
    
    aScores := ApplySoftmax(aScores)

RETURN aScores

FUNCTION ApplyAttention(aScores, aV)
    LOCAL aOutput, i, j, k, l, nSum
    LOCAL nBatchSize := Len(aScores), nSeqLen := Len(aScores[1]), nDimV := Len(aV[1, 1])
    
    aOutput := Array(nBatchSize, nSeqLen, nDimV)
    FOR i := 1 TO nBatchSize
        FOR j := 1 TO nSeqLen
            FOR k := 1 TO nDimV
                nSum := 0
                FOR l := 1 TO nSeqLen
                    nSum += aScores[i, j, l] * aV[i, l, k]
                NEXT
                aOutput[i, j, k] := nSum
            NEXT
        NEXT
    NEXT

RETURN aOutput

FUNCTION ConcatenateOutputs(aOutputs)
    LOCAL nBatchSize := Len(aOutputs[1]), nSeqLen := Len(aOutputs[1, 1])
    LOCAL nTotalDim := 0, nHeadDim, nHeads := Len(aOutputs)
    LOCAL aResult, i, j, k, l, nIndex
    
    nHeadDim := Len(aOutputs[1, 1, 1])
    nTotalDim := nHeadDim * nHeads
    
    aResult := Array(nBatchSize, nSeqLen, nTotalDim)
    FOR i := 1 TO nBatchSize
        FOR j := 1 TO nSeqLen
            nIndex := 1
            FOR k := 1 TO nHeads
                FOR l := 1 TO nHeadDim
                    aResult[i, j, nIndex] := aOutputs[k, i, j, l]
                    nIndex++
                NEXT
            NEXT
        NEXT
    NEXT

RETURN aResult

FUNCTION ApplyReLU(aInput)
    LOCAL aOutput := AClone(aInput)
    LOCAL i, j, k
    
    FOR i := 1 TO Len(aOutput)
        FOR j := 1 TO Len(aOutput[i])
            FOR k := 1 TO Len(aOutput[i, j])
                aOutput[i, j, k] := Max(0, aOutput[i, j, k])
            NEXT
        NEXT
    NEXT

RETURN aOutput

FUNCTION ApplySoftmax(aInput)
    LOCAL aOutput := AClone(aInput)
    LOCAL i, j, k, nMax, nSum, nBatchSize := Len(aInput), nSeqLen := Len(aInput[1])
    
    FOR i := 1 TO nBatchSize
        FOR j := 1 TO nSeqLen
            nMax := MaxInArray(aOutput[i, j])
            nSum := 0
            FOR k := 1 TO nSeqLen
                aOutput[i, j, k] := Exp(aOutput[i, j, k] - nMax)
                nSum += aOutput[i, j, k]
            NEXT
            FOR k := 1 TO nSeqLen
                aOutput[i, j, k] /= nSum
            NEXT
        NEXT
    NEXT

RETURN aOutput

FUNCTION AddMatrices(aA, aB)
    LOCAL aResult := AClone(aA)
    LOCAL i, j, k
    
    FOR i := 1 TO Len(aA)
        FOR j := 1 TO Len(aA[i])
            FOR k := 1 TO Len(aA[i, j])
                aResult[i, j, k] += aB[i, j, k]
            NEXT
        NEXT
    NEXT

RETURN aResult

FUNCTION GenerateRandomMatrix(nDim1, nDim2, nDim3)
    LOCAL aMatrix, i, j, k
    
    IF nDim3 == NIL
        aMatrix := Array(nDim1, nDim2)
        FOR i := 1 TO nDim1
            FOR j := 1 TO nDim2
                aMatrix[i, j] := hb_Random(0, 0.02)
            NEXT
        NEXT
    ELSE
        aMatrix := Array(nDim1, nDim2, nDim3)
        FOR i := 1 TO nDim1
            FOR j := 1 TO nDim2
                FOR k := 1 TO nDim3
                    aMatrix[i, j, k] := hb_Random(0, 0.02)
                NEXT
            NEXT
        NEXT
    ENDIF

RETURN aMatrix

FUNCTION CalcMean(aArray)
    LOCAL nSum := 0, i
    
    FOR i := 1 TO Len(aArray)
        nSum += aArray[i]
    NEXT

RETURN nSum / Len(aArray)

FUNCTION CalcVariance(aArray, nMean)
    LOCAL nSum := 0, i
    
    FOR i := 1 TO Len(aArray)
        nSum += (aArray[i] - nMean) ** 2
    NEXT

RETURN nSum / Len(aArray)

FUNCTION MaxInArray(aArray)
    LOCAL nMax := aArray[1], i
    
    FOR i := 2 TO Len(aArray)
        IF aArray[i] > nMax
            nMax := aArray[i]
        ENDIF
    NEXT

RETURN nMax

FUNCTION PrintMatrix(aMatrix)
    LOCAL i, j, k
    
    FOR i := 1 TO Len(aMatrix)
        ? "Batch", i
        FOR j := 1 TO Len(aMatrix[i])
            ?? "  Seq", j, ":"
            FOR k := 1 TO Len(aMatrix[i, j])
                ?? Round(aMatrix[i, j, k], 4), " "
            NEXT
            ?
        NEXT
        ?
    NEXT

RETURN NIL

Input con codificación posicional:
Batch 1 Seq 1 : 0.0014 1.0197 0.0004 1.0142 0.0178 1.0140 0.0108 1.0132
Seq 2 : 0.8541 0.5441 0.1168 1.0027 0.0280 1.0131 0.0070 1.0059
Seq 3 : 0.9097 -0.4140 0.2128 0.9911 0.0317 1.0035 0.0173 1.0191
Seq 4 : 0.1433 -0.9840 0.3029 0.9676 0.0489 1.0189 0.0213 1.0060
Seq 5 : -0.7532 -0.6397 0.3924 0.9272 0.0599 1.0040 0.0056 1.0126

Batch 2 Seq 1 : 0.0183 1.0073 0.0185 1.0091 0.0085 1.0046 0.0082 1.0038
Seq 2 : 0.8497 0.5552 0.1064 0.9975 0.0107 1.0165 0.0106 1.0086
Seq 3 : 0.9163 -0.4070 0.1992 0.9964 0.0240 1.0068 0.0071 1.0143
Seq 4 : 0.1552 -0.9761 0.3045 0.9570 0.0360 1.0150 0.0064 1.0168
Seq 5 : -0.7411 -0.6356 0.4068 0.9296 0.0404 1.0129 0.0205 1.0003

Salida del Encoder:
Batch 1 Seq 1 : -1.0144 1.0087 -1.0139 0.9987 -0.9794 0.9981 -0.9920 0.9942
Seq 2 : 0.6568 -0.0642 -1.0607 1.0069 -1.2676 1.0320 -1.3151 1.0120
Seq 3 : 0.8163 -1.6535 -0.4832 0.9712 -0.8206 0.9948 -0.8464 1.0215
Seq 4 : -0.2712 -2.0317 -0.0200 1.0203 -0.4167 1.1004 -0.4590 1.0780
Seq 5 : -1.5150 -1.3417 0.2130 1.0197 -0.2879 1.1349 -0.3686 1.1456

Batch 2 Seq 1 : -0.9924 1.0021 -0.9892 1.0067 -1.0095 0.9976 -1.0088 0.9934
Seq 2 : 0.6464 -0.0333 -1.0720 0.9920 -1.2932 1.0368 -1.2921 1.0154
Seq 3 : 0.8283 -1.6297 -0.5029 0.9799 -0.8282 0.9997 -0.8583 1.0112
Seq 4 : -0.2511 -2.0214 -0.0156 1.0076 -0.4357 1.0983 -0.4811 1.0991
Seq 5 : -1.5075 -1.3457 0.2309 1.0229 -0.3233 1.1482 -0.3523 1.1269

Post by **Antonio Linares** » Sun Jul 07, 2024 6:46 am

https://github.com/FiveTechSoft/transformer

FiveTech Software tech support forums

Understanding Transformers

Understanding Transformers

Re: Understanding Transformers

Re: Understanding Transformers