Description

Book Synopsis

Transformers are becoming a core part of many neural network architectures, employed in a wide range of applications such as NLP, Speech Recognition, Time Series, and Computer Vision. Transformers have gone through many adaptations and alterations, resulting in newer techniques and methods. Transformers for Machine Learning: A Deep Dive is the first comprehensive book on transformers.

Key Features:

  • A comprehensive reference book for detailed explanations for every algorithm and techniques related to the transformers.
  • 60+ transformer architectures covered in a comprehensive manner.
  • A book for understanding how to apply the transformer techniques in speech, text, time series, and computer vision.
  • Practical tips and tricks for each architecture and how to use it in the real world.
  • Hands-on case studies and code snippets for theory and practical real-world analysis using the tools and libraries, all ready to run in Goog

    Table of Contents

    List of Figures
    List of Tables
    Author Bios
    Foreword
    Preface
    Contributors

    Deep Learning and Transformers: An Introduction
    1.1 DEEP LEARNING: A HISTORIC PERSPECTIVE
    1.2 TRANSFORMERS AND TAXONOMY
    1.2.1 Modified Transformer Architecture
    1.2.1.1 Transformer block changes
    1.2.1.2 Transformer sublayer changes
    1.2.2 Pretraining Methods and Applications
    1.3 RESOURCES
    1.3.1 Libraries and Implementations
    1.3.2 Books
    1.3.3 Courses, Tutorials, and Lectures
    1.3.4 Case Studies and Details

    Transformers: Basics and Introduction
    2.1 ENCODER-DECODER ARCHITECTURE
    2.2 SEQUENCE TO SEQUENCE
    2.2.1 Encoder
    2.2.2 Decoder
    2.2.3 Training
    2.2.4 Issues with RNN-based Encoder Decoder
    2.3 ATTENTION MECHANISM
    2.3.1 Background
    2.3.2 Types of Score-Based Attention
    2.3.2.1 Dot Product (multiplicative)
    2.3.2.2 Scaled Dot Product or multiplicative
    2.3.2.3 Linear, MLP, or additive
    2.3.3 Attention-based Sequence to Sequence
    2.4 TRANSFORMER
    2.4.1 Source and Target Representation
    2.4.1.1 Word Embedding
    2.4.1.2 Positional Encoding
    2.4.2 Attention Layers
    2.4.2.1 Self-Attention
    2.4.2.2 Multi-Head Attention
    2.4.2.3 Masked Multi-Head Attention
    2.4.2.4 Encoder-Decoder Multi-Head Attention
    2.4.3 Residuals and Layer Normalization
    2.4.4 Position-wise Feed-Forward Networks
    2.4.5 Encoder
    2.4.6 Decoder
    2.5 CASE STUDY: MACHINE TRANSLATION
    2.5.1 Goal
    2.5.2 Data, Tools and Libraries
    2.5.3 Experiments, Results and Analysis
    2.5.3.1 Exploratory Data Analysis
    2.5.3.2 Attention
    2.5.3.3 Transformer
    2.5.3.4 Results and Analysis
    2.5.3.5 Explainability

    Bidirectional Encoder Representations from Transformers (BERT)
    3.1 BERT
    3.1.1 Architecture
    3.1.2 Pre-training
    3.1.3 Fine-tuning
    3.2 BERT VARIANTS
    3.2.1 RoBERTa
    3.3 APPLICATIONS
    3.3.1 TaBERT
    3.3.2 BERTopic
    3.4 BERT INSIGHTS
    3.4.1 BERT Sentence Representation
    3.4.2 BERTology
    3.5 CASE STUDY: TOPIC MODELING WITH TRANSFORMERS
    3.5.1 Goal
    3.5.2 Data, Tools, and Libraries
    3.5.2.1 Data
    3.5.2.2 Compute embeddings
    3.5.3 Experiments, Results, and Analysis
    3.5.3.1 Building Topics
    3.5.3.2 Topic size distribution
    3.5.3.3 Visualization of topics
    3.5.3.4 Content of topics
    3.6 CASE STUDY: FINE-TUNING BERT
    3.6.1 Goal
    3.6.2 Data, Tools and Libraries
    3.6.3 Experiments, Results and Analysis

    Multilingual Transformer Architectures
    4.1 MULTILINGUAL TRANSFORMER ARCHITECTURES
    4.1.1 Basic Multilingual Transformer
    4.1.2 Single-Encoder Multilingual NLU
    4.1.2.1 mBERT
    4.1.2.2 XLM
    4.1.2.3 XLM-RoBERTa
    4.1.2.4 ALM
    4.1.2.5 Unicoder
    4.1.2.6 INFOXL
    4.1.2.7 AMBER
    4.1.2.8 ERNIE-M
    4.1.2.9 HITCL
    4.1.3 Dual-Encoder Multilingual NLU
    4.1.3.1 LaBSE
    4.1.3.2 mUSE
    4.1.4 Multilingual NLG
    4.2 MULTILINGUAL DATA
    4.2.1 Pre-training Data
    4.2.2 Multilingual Benchmarks
    4.2.2.1 Classification
    4.2.2.2 Structure Prediction
    4.2.2.3 Question Answering
    4.2.2.4 Semantic Retrieval
    4.3 MULTILINGUAL TRANSFER LEARNING INSIGHTS
    4.3.1 Zero-shot Cross-lingual Learning
    4.3.1.1 Data Factors
    4.3.1.2 Model Architecture Factors
    4.3.1.3 Model Tasks Factors
    4.3.2 Language-agnostic Cross-lingual Representations
    4.4 CASE STUDY
    4.4.1 Goal
    4.4.2 Data, Tools, and Libraries
    4.4.3 Experiments, Results, and Analysis
    4.4.3.1 Data Preprocessing
    4.4.3.2 Experiments

    Transformer Modifications
    5.1 TRANSFORMER BLOCK MODIFICATIONS
    5.1.1 Lightweight Transformers
    5.1.1.1 Funnel-Transformer
    5.1.1.2 DeLighT
    5.1.2 Connections between Transformer Blocks
    5.1.2.1 RealFormer
    5.1.3 Adaptive Computation Time
    5.1.3.1 Universal Transformers (UT)
    5.1.4 Recurrence Relations between Transformer Blocks
    5.1.4.1 Transformer-XL
    5.1.5 Hierarchical Transformers
    5.2 TRANSFORMERS WITH MODIFIED MULTI-HEAD SELF-ATTENTION
    5.2.1 Structure of Multi-head Self-Attention
    5.2.1.1 Multi-head self-attention
    5.2.1.2 Space and time complexity
    5.2.2 Reducing Complexity of Self-attention
    5.2.2.1 Longformer
    5.2.2.2 Reformer
    5.2.2.3 Performer
    5.2.2.4 Big Bird
    5.2.3 Improving Multi-head-attention
    5.2.3.1 Talking-Heads Attention
    5.2.4 Biasing Attention with Priors
    5.2.5 Prototype Queries
    5.2.5.1 Clustered Attention
    5.2.6 Compressed Key-Value Memory
    5.2.6.1 Luna: Linear Unified Nested Attention
    5.2.7 Low-rank Approximations
    5.2.7.1 Linformer
    5.3 MODIFICATIONS FOR TRAINING TASK EFFICIENCY
    5.3.1 ELECTRA
    5.3.1.1 Replaced token detection
    5.3.2 T5
    5.4 TRANSFORMER SUBMODULE CHANGES
    5.4.1 Switch Transformer
    5.5 CASE STUDY: SENTIMENT ANALYSIS
    5.5.1 Goal
    5.5.2 Data, Tools, and Libraries
    5.5.3 Experiments, Results, and Analysis
    5.5.3.1 Visualizing attention head weights
    5.5.3.2 Analysis

    Pretrained and Application-Specific Transformers
    6.1 TEXT PROCESSING
    6.1.1 Domain-Specific Transformers
    6.1.1.1 BioBERT
    6.1.1.2 SciBERT
    6.1.1.3 FinBERT
    6.1.2 Text-to-text Transformers
    6.1.2.1 ByT5
    6.1.3 Text generation
    6.1.3.1 GPT: Generative Pre-training
    6.1.3.2 GPT-2
    6.1.3.3 GPT-3
    6.2 COMPUTER VISION
    6.2.1 Vision Transformer
    6.3 AUTOMATIC SPEECH RECOGNITION
    6.3.1 Wav2vec 2.0
    6.3.2 Speech2Text2
    6.3.3 HuBERT: Hidden Units BERT
    6.4 MULTIMODAL AND MULTITASKING TRANSFORMER
    6.4.1 Vision-and-Language BERT (VilBERT)
    6.4.2 Unified Transformer (UniT)
    6.5 VIDEO PROCESSING WITH TIMESFORMER
    6.5.1 Patch embeddings
    6.5.2 Self-attention
    6.5.2.1 Spatiotemporal self-attention
    6.5.2.2 Spatiotemporal attention blocks
    6.6 GRAPH TRANSFORMERS
    6.6.1 Positional encodings in a graph
    6.6.1.1 Laplacian positional encodings
    6.6.2 Graph transformer input
    6.6.2.1 Graphs without edge attributes
    6.6.2.2 Graphs with edge attributes
    6.7 REINFORCEMENT LEARNING
    6.7.1 Decision Transformer
    6.8 CASE STUDY: AUTOMATIC SPEECH RECOGNITION
    6.8.1 Goal
    6.8.2 Data, Tools, and Libraries
    6.8.3 Experiments, Results, and Analysis
    6.8.3.1 Preprocessing speech data
    6.8.3.2 Evaluation

    Interpretability and Explainability Techniques for Transformers
    7.1 TRAITS OF EXPLAINABLE SYSTEMS
    7.2 RELATED AREAS THAT IMPACT EXPLAINABILITY
    7.3 EXPLAINABLE METHODS TAXONOMY
    7.3.1 Visualization Methods
    7.3.1.1 Backpropagation-based
    7.3.1.2 Perturbation-based
    7.3.2 Model Distillation
    7.3.2.1 Local Approximation
    7.3.2.2 Model Translation
    7.3.3 Intrinsic Methods
    7.3.3.1 Probing Mechanism
    7.3.3.2 Joint Training
    7.4 ATTENTION AND EXPLANATION
    7.4.1 Attention is not Explanation
    7.4.1.1 Attention Weights and Feature Importance
    7.4.1.2 Counterfactual Experiments
    7.4.2 Attention is not not Explanation
    7.4.2.1 Is attention necessary for all tasks?
    7.4.2.2 Searching for Adversarial Models
    7.4.2.3 Attention Probing
    7.5 QUANTIFYING ATTENTION FLOW
    7.5.1 Information flow as DAG
    7.5.2 Attention Rollout
    7.5.3 Attention Flow
    7.6 CASE STUDY: TEXT CLASSIFICATION WITH EXPLAINABILITY
    7.6.1 Goal
    7.6.2 Data, Tools, and Libraries
    7.6.3 Experiments, Results and Analysis
    7.6.3.1 Exploratory Data Analysis
    7.6.3.2 Experiments
    7.6.3.3 Error Analysis and Explainability

    Bibliography
    Alphabetical Index

Transformers for Machine Learning

Product form

£42.74

Includes FREE delivery

RRP £44.99 – you save £2.25 (5%)

Order before 4pm tomorrow for delivery by Sat 20 Dec 2025.

A Paperback by Uday Kamath, Kenneth Graham, Wael Emara

1 in stock


    View other formats and editions of Transformers for Machine Learning by Uday Kamath

    Publisher: CRC Press
    Publication Date: 5/25/2022 12:00:00 AM
    ISBN13: 9780367767341, 978-0367767341
    ISBN10: 0367767341

    Description

    Book Synopsis

    Transformers are becoming a core part of many neural network architectures, employed in a wide range of applications such as NLP, Speech Recognition, Time Series, and Computer Vision. Transformers have gone through many adaptations and alterations, resulting in newer techniques and methods. Transformers for Machine Learning: A Deep Dive is the first comprehensive book on transformers.

    Key Features:

    • A comprehensive reference book for detailed explanations for every algorithm and techniques related to the transformers.
    • 60+ transformer architectures covered in a comprehensive manner.
    • A book for understanding how to apply the transformer techniques in speech, text, time series, and computer vision.
    • Practical tips and tricks for each architecture and how to use it in the real world.
    • Hands-on case studies and code snippets for theory and practical real-world analysis using the tools and libraries, all ready to run in Goog

      Table of Contents

      List of Figures
      List of Tables
      Author Bios
      Foreword
      Preface
      Contributors

      Deep Learning and Transformers: An Introduction
      1.1 DEEP LEARNING: A HISTORIC PERSPECTIVE
      1.2 TRANSFORMERS AND TAXONOMY
      1.2.1 Modified Transformer Architecture
      1.2.1.1 Transformer block changes
      1.2.1.2 Transformer sublayer changes
      1.2.2 Pretraining Methods and Applications
      1.3 RESOURCES
      1.3.1 Libraries and Implementations
      1.3.2 Books
      1.3.3 Courses, Tutorials, and Lectures
      1.3.4 Case Studies and Details

      Transformers: Basics and Introduction
      2.1 ENCODER-DECODER ARCHITECTURE
      2.2 SEQUENCE TO SEQUENCE
      2.2.1 Encoder
      2.2.2 Decoder
      2.2.3 Training
      2.2.4 Issues with RNN-based Encoder Decoder
      2.3 ATTENTION MECHANISM
      2.3.1 Background
      2.3.2 Types of Score-Based Attention
      2.3.2.1 Dot Product (multiplicative)
      2.3.2.2 Scaled Dot Product or multiplicative
      2.3.2.3 Linear, MLP, or additive
      2.3.3 Attention-based Sequence to Sequence
      2.4 TRANSFORMER
      2.4.1 Source and Target Representation
      2.4.1.1 Word Embedding
      2.4.1.2 Positional Encoding
      2.4.2 Attention Layers
      2.4.2.1 Self-Attention
      2.4.2.2 Multi-Head Attention
      2.4.2.3 Masked Multi-Head Attention
      2.4.2.4 Encoder-Decoder Multi-Head Attention
      2.4.3 Residuals and Layer Normalization
      2.4.4 Position-wise Feed-Forward Networks
      2.4.5 Encoder
      2.4.6 Decoder
      2.5 CASE STUDY: MACHINE TRANSLATION
      2.5.1 Goal
      2.5.2 Data, Tools and Libraries
      2.5.3 Experiments, Results and Analysis
      2.5.3.1 Exploratory Data Analysis
      2.5.3.2 Attention
      2.5.3.3 Transformer
      2.5.3.4 Results and Analysis
      2.5.3.5 Explainability

      Bidirectional Encoder Representations from Transformers (BERT)
      3.1 BERT
      3.1.1 Architecture
      3.1.2 Pre-training
      3.1.3 Fine-tuning
      3.2 BERT VARIANTS
      3.2.1 RoBERTa
      3.3 APPLICATIONS
      3.3.1 TaBERT
      3.3.2 BERTopic
      3.4 BERT INSIGHTS
      3.4.1 BERT Sentence Representation
      3.4.2 BERTology
      3.5 CASE STUDY: TOPIC MODELING WITH TRANSFORMERS
      3.5.1 Goal
      3.5.2 Data, Tools, and Libraries
      3.5.2.1 Data
      3.5.2.2 Compute embeddings
      3.5.3 Experiments, Results, and Analysis
      3.5.3.1 Building Topics
      3.5.3.2 Topic size distribution
      3.5.3.3 Visualization of topics
      3.5.3.4 Content of topics
      3.6 CASE STUDY: FINE-TUNING BERT
      3.6.1 Goal
      3.6.2 Data, Tools and Libraries
      3.6.3 Experiments, Results and Analysis

      Multilingual Transformer Architectures
      4.1 MULTILINGUAL TRANSFORMER ARCHITECTURES
      4.1.1 Basic Multilingual Transformer
      4.1.2 Single-Encoder Multilingual NLU
      4.1.2.1 mBERT
      4.1.2.2 XLM
      4.1.2.3 XLM-RoBERTa
      4.1.2.4 ALM
      4.1.2.5 Unicoder
      4.1.2.6 INFOXL
      4.1.2.7 AMBER
      4.1.2.8 ERNIE-M
      4.1.2.9 HITCL
      4.1.3 Dual-Encoder Multilingual NLU
      4.1.3.1 LaBSE
      4.1.3.2 mUSE
      4.1.4 Multilingual NLG
      4.2 MULTILINGUAL DATA
      4.2.1 Pre-training Data
      4.2.2 Multilingual Benchmarks
      4.2.2.1 Classification
      4.2.2.2 Structure Prediction
      4.2.2.3 Question Answering
      4.2.2.4 Semantic Retrieval
      4.3 MULTILINGUAL TRANSFER LEARNING INSIGHTS
      4.3.1 Zero-shot Cross-lingual Learning
      4.3.1.1 Data Factors
      4.3.1.2 Model Architecture Factors
      4.3.1.3 Model Tasks Factors
      4.3.2 Language-agnostic Cross-lingual Representations
      4.4 CASE STUDY
      4.4.1 Goal
      4.4.2 Data, Tools, and Libraries
      4.4.3 Experiments, Results, and Analysis
      4.4.3.1 Data Preprocessing
      4.4.3.2 Experiments

      Transformer Modifications
      5.1 TRANSFORMER BLOCK MODIFICATIONS
      5.1.1 Lightweight Transformers
      5.1.1.1 Funnel-Transformer
      5.1.1.2 DeLighT
      5.1.2 Connections between Transformer Blocks
      5.1.2.1 RealFormer
      5.1.3 Adaptive Computation Time
      5.1.3.1 Universal Transformers (UT)
      5.1.4 Recurrence Relations between Transformer Blocks
      5.1.4.1 Transformer-XL
      5.1.5 Hierarchical Transformers
      5.2 TRANSFORMERS WITH MODIFIED MULTI-HEAD SELF-ATTENTION
      5.2.1 Structure of Multi-head Self-Attention
      5.2.1.1 Multi-head self-attention
      5.2.1.2 Space and time complexity
      5.2.2 Reducing Complexity of Self-attention
      5.2.2.1 Longformer
      5.2.2.2 Reformer
      5.2.2.3 Performer
      5.2.2.4 Big Bird
      5.2.3 Improving Multi-head-attention
      5.2.3.1 Talking-Heads Attention
      5.2.4 Biasing Attention with Priors
      5.2.5 Prototype Queries
      5.2.5.1 Clustered Attention
      5.2.6 Compressed Key-Value Memory
      5.2.6.1 Luna: Linear Unified Nested Attention
      5.2.7 Low-rank Approximations
      5.2.7.1 Linformer
      5.3 MODIFICATIONS FOR TRAINING TASK EFFICIENCY
      5.3.1 ELECTRA
      5.3.1.1 Replaced token detection
      5.3.2 T5
      5.4 TRANSFORMER SUBMODULE CHANGES
      5.4.1 Switch Transformer
      5.5 CASE STUDY: SENTIMENT ANALYSIS
      5.5.1 Goal
      5.5.2 Data, Tools, and Libraries
      5.5.3 Experiments, Results, and Analysis
      5.5.3.1 Visualizing attention head weights
      5.5.3.2 Analysis

      Pretrained and Application-Specific Transformers
      6.1 TEXT PROCESSING
      6.1.1 Domain-Specific Transformers
      6.1.1.1 BioBERT
      6.1.1.2 SciBERT
      6.1.1.3 FinBERT
      6.1.2 Text-to-text Transformers
      6.1.2.1 ByT5
      6.1.3 Text generation
      6.1.3.1 GPT: Generative Pre-training
      6.1.3.2 GPT-2
      6.1.3.3 GPT-3
      6.2 COMPUTER VISION
      6.2.1 Vision Transformer
      6.3 AUTOMATIC SPEECH RECOGNITION
      6.3.1 Wav2vec 2.0
      6.3.2 Speech2Text2
      6.3.3 HuBERT: Hidden Units BERT
      6.4 MULTIMODAL AND MULTITASKING TRANSFORMER
      6.4.1 Vision-and-Language BERT (VilBERT)
      6.4.2 Unified Transformer (UniT)
      6.5 VIDEO PROCESSING WITH TIMESFORMER
      6.5.1 Patch embeddings
      6.5.2 Self-attention
      6.5.2.1 Spatiotemporal self-attention
      6.5.2.2 Spatiotemporal attention blocks
      6.6 GRAPH TRANSFORMERS
      6.6.1 Positional encodings in a graph
      6.6.1.1 Laplacian positional encodings
      6.6.2 Graph transformer input
      6.6.2.1 Graphs without edge attributes
      6.6.2.2 Graphs with edge attributes
      6.7 REINFORCEMENT LEARNING
      6.7.1 Decision Transformer
      6.8 CASE STUDY: AUTOMATIC SPEECH RECOGNITION
      6.8.1 Goal
      6.8.2 Data, Tools, and Libraries
      6.8.3 Experiments, Results, and Analysis
      6.8.3.1 Preprocessing speech data
      6.8.3.2 Evaluation

      Interpretability and Explainability Techniques for Transformers
      7.1 TRAITS OF EXPLAINABLE SYSTEMS
      7.2 RELATED AREAS THAT IMPACT EXPLAINABILITY
      7.3 EXPLAINABLE METHODS TAXONOMY
      7.3.1 Visualization Methods
      7.3.1.1 Backpropagation-based
      7.3.1.2 Perturbation-based
      7.3.2 Model Distillation
      7.3.2.1 Local Approximation
      7.3.2.2 Model Translation
      7.3.3 Intrinsic Methods
      7.3.3.1 Probing Mechanism
      7.3.3.2 Joint Training
      7.4 ATTENTION AND EXPLANATION
      7.4.1 Attention is not Explanation
      7.4.1.1 Attention Weights and Feature Importance
      7.4.1.2 Counterfactual Experiments
      7.4.2 Attention is not not Explanation
      7.4.2.1 Is attention necessary for all tasks?
      7.4.2.2 Searching for Adversarial Models
      7.4.2.3 Attention Probing
      7.5 QUANTIFYING ATTENTION FLOW
      7.5.1 Information flow as DAG
      7.5.2 Attention Rollout
      7.5.3 Attention Flow
      7.6 CASE STUDY: TEXT CLASSIFICATION WITH EXPLAINABILITY
      7.6.1 Goal
      7.6.2 Data, Tools, and Libraries
      7.6.3 Experiments, Results and Analysis
      7.6.3.1 Exploratory Data Analysis
      7.6.3.2 Experiments
      7.6.3.3 Error Analysis and Explainability

      Bibliography
      Alphabetical Index

    Recently viewed products

    © 2025 Book Curl

      • American Express
      • Apple Pay
      • Diners Club
      • Discover
      • Google Pay
      • Maestro
      • Mastercard
      • PayPal
      • Shop Pay
      • Union Pay
      • Visa

      Login

      Forgot your password?

      Don't have an account yet?
      Create account