39.3 Attention Mechanism in Seq2Seq Models
Right, so here we are. You’ve built your first sequence-to-sequence model, probably an LSTM-based encoder-decoder. It works. Sort of. You feed it a sentence, it encodes the whole thing into a single, fixed-size vector—a “thought vector,” if you will—and then the decoder has to spit out a perfect translation based solely on that compressed memory. And you quickly realize this is a bit of a nightmare. That one vector becomes an information bottleneck. The poor decoder, especially when dealing with long sequences, is like a student trying to write a perfect essay based on a single, crammed note they wrote two weeks ago. It forgets the beginning of the sentence by the time it gets to the end. The translations for long inputs become vague, generic, and frankly, a bit useless.