Description

Book Synopsis

The only book to offer special coverage of the fundamentals of multicore DSP for implementation on the TMS320C66xx SoC

This unique book provides readers with an understanding of the TMS320C66xx SoC as well as its constraints. It offers critical analysis of each element, which not only broadens their knowledge of the subject, but aids them in gaining a better understanding of how these elements work so well together.

Written by Texas Instruments' First DSP Educator Award winner, Naim Dahnoun, the book teaches readers how to use the development tools, take advantage of the maximum performance and functionality of this processor and have an understanding of the rich content which spans from architecture, development tools and programming models, such as OpenCL and OpenMP, to debugging tools. It also covers various multicore audio and image applications in detail. Additionally, this one-of-a-kind book is supplemented with:

  • A rich set of tested laborator

    Table of Contents

    Preface xviii

    Acknowledgements xxi

    Foreword xxii

    About the Companion Website xxiii

    1 Introduction to DSP 1

    1.1 Introduction 1

    1.2 Multicore processors 3

    1.2.1 Can any algorithm benefit from a multicore processor? 3

    1.2.2 How many cores do I need for my application? 5

    1.3 Key applications of high-performance multicore devices 6

    1.4 FPGAs, Multicore DSPs, GPUs and Multicore CPUs 8

    1.5 Challenges faced for programming a multicore processor 9

    1.6 Texas Instruments DSP roadmap 10

    1.7 Conclusion 11

    References 12

    2 The TMS320C66x architecture overview 14

    2.1 Overview 14

    2.2 The CPU 15

    2.2.1 Cross paths 16

    2.2.1.1 Data cross paths 17

    2.2.1.2 Address cross paths 18

    2.2.2 Register file A and file B 20

    2.2.2.1 Operands 20

    2.2.3 Functional units 21

    2.2.3.1 Condition registers 21

    2.2.3.2 .L units 22

    2.2.3.3 .M units 22

    2.2.3.4 .S units 23

    2.2.3.5 .D units 23

    2.3 Single instruction, multiple data (SIMD) instructions 24

    2.3.1 Control registers 24

    2.4 The KeyStone memory 24

    2.4.1 Using the internal memory 27

    2.4.2 Memory protection and extension 29

    2.4.3 Memory throughput 29

    2.5 Peripherals 30

    2.5.1 Navigator 32

    2.5.2 Enhanced Direct Memory Access (EDMA) Controller 32

    2.5.3 Universal Asynchronous Receiver/Transmitter (UART) 32

    2.5.4 General purpose input–output (GPIO) 32

    2.5.5 Internal timers 32

    2.6 Conclusion 33

    References 33

    3 Software development tools and the TMS320C6678 EVM 35

    3.1 Introduction 35

    3.2 Software development tools 37

    3.2.1 Compiler 38

    3.2.2 Assembler 39

    3.2.3 Linker 40

    3.2.3.1 Linker command file 40

    3.2.4 Compile, assemble and link 42

    3.2.5 Using the Real-Time Software Components (RTSC) tools 42

    3.2.5.1 Platform update using the XDCtools 42

    3.2.6 KeyStone Multicore Software Development Kit 47

    3.3 Hardware development tools 47

    3.3.1 EVM features 47

    3.4 Laboratory experiments based on the C6678 EVM: introduction to Code Composer Studio (CCS) 51

    3.4.1 Software and hardware requirements 51

    3.4.1.1 Key features 52

    3.4.1.2 Download sites 53

    3.4.2 Laboratory experiments with the CCS6 53

    3.4.2.1 Introduction to CCS 55

    3.4.2.2 Implementation of a DOTP algorithm 63

    3.4.3 Profiling using the clock 65

    3.4.4 Considerations when measuring time 67

    3.5 Loading different applications to different cores 67

    3.6 Conclusion 72

    References 72

    4 Numerical issues 74

    4.1 Introduction 74

    4.2 Fixed- and floating-point representations 75

    4.2.1 Fixed-point arithmetic 76

    4.2.1.1 Unsigned integer 76

    4.2.1.2 Signed integer 77

    4.2.1.3 Fractional numbers 77

    4.2.2 Floating-point arithmetic 78

    4.2.2.1 Special numbers for the 32-bit and 64-bit floating-point formats 81

    4.3 Dynamic range and accuracy 82

    4.4 Laboratory exercise 83

    4.5 Conclusion 85

    References 85

    5 Software optimisation 86

    5.1 Introduction 86

    5.2 Hindrance to software scalability for a multicore processor 88

    5.3 Single-core code optimisation procedure 88

    5.3.1 The C compiler options 90

    5.4 Interfacing C with intrinsics, linear assembly and assembly 91

    5.4.1 Intrinsics 91

    5.4.2 Interfacing C and assembly 92

    5.5 Assembly optimisation 97

    5.5.1 Parallel instructions 98

    5.5.2 Removing the NOPs 99

    5.5.3 Loop unrolling 99

    5.5.4 Double-Word Access 100

    5.5.5 Optimisation summary 100

    5.6 Software pipelining 101

    5.6.1 Software-pipelining procedure 105

    5.6.1.1 Writing linear assembly code 105

    5.6.1.2 Creating a dependency graph 105

    5.6.1.3 Resource allocation 108

    5.6.1.4 Scheduling table 108

    5.6.1.5 Generating assembly code 109

    5.7 Linear assembly 111

    5.7.1 Hand optimisation of the dotp function using linear assembly 112

    5.8 Avoiding memory banks 118

    5.9 Optimisation using the tools 118

    5.10 Laboratory experiments 123

    5.11 Conclusion 126

    References 126

    6 The TMS320C66x interrupts 127

    6.1 Introduction 127

    6.1.1 Chip-level interrupt controller 129

    6.2 The interrupt controller 135

    6.3 Laboratory experiment 140

    6.3.1 Experiment 1: Using the GIPIOs to trigger some functions 140

    6.3.2 Experiment 2: Using the console to trigger an interrupt 140

    6.4 Conclusion 143

    References 144

    7 Real-time operating system: TI-RTOS 145

    7.1 Introduction 146

    7.2 TI-RTOS 146

    7.3 Real-time scheduling 148

    7.3.1 Hardware interrupts (Hwis) 148

    7.3.1.1 Setting an Hwi 149

    7.3.1.2 Hwi hook functions 149

    7.3.2 Software interrupts (Swis), including clock, periodic or single-shot functions 155

    7.3.3 Tasks 155

    7.3.3.1 Task hook functions 157

    7.3.4 Idle functions 158

    7.3.5 Clock functions 158

    7.3.6 Timer functions 158

    7.3.7 Synchronisation 158

    7.3.7.1 Semaphores 159

    7.3.7.2 Semaphore_pend 159

    7.3.7.3 Semaphore_post 159

    7.3.7.4 How to configure the semaphores 159

    7.3.8 Events 159

    7.3.9 Summary 163

    7.4 Dynamic memory management 163

    7.4.1 Stack allocation 165

    7.4.2 Heap allocation 165

    7.4.3 Heap implementation 165

    7.4.3.1 HeapMin implementation 165

    7.4.3.2 HeapMem implementation 165

    7.4.3.3 HeapBuf implementation 167

    7.4.3.4 HeapMultiBuf implementation 171

    7.5 Laboratory experiments 172

    7.5.1 Lab 1: Manual setup of the clock (part 1) 172

    7.5.2 Lab 2: Manual setup of the clock (part 2) 172

    7.5.3 Lab 3: Using Hwis, Swis, tasks and clocks 174

    7.5.4 Lab 4: Using events 187

    7.5.5 Lab 5: Using the heaps 189

    7.6 Conclusion 190

    References 191

    References (further reading) 191

    8 Enhanced Direct Memory Access (EDMA3) controller 192

    8.1 Introduction 192

    8.2 Type of DMAs available 193

    8.3 EDMA controllers architecture 194

    8.3.1 The EDMA3 Channel Controller (EDMA3CC) 194

    8.3.2 The EDMA3 transfer controller (EDMA3TC) 201

    8.3.3 EDMA prioritisation 201

    8.3.3.1 Trigger source priority 202

    8.3.3.2 Channel priority 203

    8.3.3.3 Dequeue priority 203

    8.3.3.4 System (transfer controller) priority 203

    8.4 Parameter RAM (PaRAM) 203

    8.4.1 Channel options parameter (OPT) 203

    8.5 Transfer synchronisation dimensions 203

    8.5.1 A – Synchronisation 204

    8.5.2 AB – Synchronisation 204

    8.6 Simple EDMA transfer 204

    8.7 Chaining EDMA transfers 208

    8.8 Linked EDMAs 208

    8.9 Laboratory experiments 210

    8.9.1 Laboratory 1: Simple EDMA transfer 211

    8.9.2 Laboratory 2: EDMA chaining transfer 211

    8.9.3 Laboratory 3: EDMA link transfer 213

    8.10 Conclusion 213

    References 213

    9 Inter-Processor Communication (IPC) 214

    9.1 Introduction 215

    9.2 Texas Instruments IPC 217

    9.3 Notify module 219

    9.3.1 Laboratory experiment 222

    9.4 MessageQ 222

    9.4.1 MessageQ protocol 224

    9.4.2 Message priority 229

    9.4.3 Thread synchronisation 229

    9.5 ListMP module 233

    9.6 GateMP module 234

    9.6.1 Initialising a GateMP parameter structure 234

    9.6.1.1 Types of gate protection 235

    9.6.2 Creating a GateMP instance 236

    9.6.3 Entering a GateMP 236

    9.6.4 Leaving a gate 236

    9.6.5 The list of functions that can be used by GateMP 237

    9.7 Multi-processor Memory Allocation: HeapBufMP, HeapMemMP and HeapMultiBufMP 237

    9.7.1 HeapBuf_Params 238

    9.7.2 HeapMem_Params 239

    9.7.3 HeapMultiBuf_Params 239

    9.7.4 Configuration example for HeapMultiBuf 239

    9.8 Transport mechanisms for the IPC 241

    9.9 Laboratory experiments with KeyStone I 241

    9.9.1 Laboratory 1: Using MessageQ with multiple cores 241

    9.9.1.1 Overview 242

    9.9.2 Laboratory 2: Using ListMP, ShareRegion and GateMP 243

    9.10 Laboratory experiments with KeyStone II 249

    9.10.1 Laboratory experiment 1: Transferring a block of data 249

    9.10.1.1 Set the connection between the host (PC) and the KeyStone 249

    9.10.1.2 Explore the ARM code 250

    9.10.1.3 Explore the DSP code 259

    9.10.1.4 Compile and run the program 263

    9.10.2 Laboratory experiment 2: Transferring a pointer 267

    9.10.2.1 Explore the ARM code 267

    9.10.2.2 Explore the DSP code 271

    9.10.2.3 Compile and run the program 278

    9.11 Conclusion 278

    References 278

    10 Single and multicore debugging 280

    10.1 Introduction 281

    10.2 Software and hardware debugging 282

    10.3 Debug architecture 282

    10.3.1 Trace 282

    10.3.1.1 Standard trace 282

    10.3.1.2 Event trace 283

    10.3.1.3 System trace 285

    10.4 Advanced Event Triggering 286

    10.4.1 Advanced Event Triggering logic 289

    10.4.2 Unified Breakpoint Manager 294

    10.5 Unified Instrumentation Architecture 295

    10.5.1 Host-side tooling 295

    10.5.2 Target-side tooling 295

    10.5.2.1 Software instrumentation APIs 297

    10.5.2.2 Predefined software events and metadata 297

    10.5.2.3 Event loggers 297

    10.5.2.4 Transports 297

    10.5.2.5 SYS/BIOS event capture and transport 297

    10.5.2.6 Multicore support 297

    10.6 Debugging with the System Analyzer tools 298

    10.6.1 Target-side coding with UIA APIs and the XDCtools 299

    10.6.2 Logging events with Log_write() functions 300

    10.6.3 Advance debugging using the diagnostic feature 301

    10.6.4 LogSnapshot APIs for logging state information 302

    10.7 Instrumentation with TI-RTOS and CCS 302

    10.7.1 Using RTOS Object Viewer 302

    10.7.2 Using the RTOS Analyzer and the System Analyzer 303

    10.7.2.1 RTOS Analyzer 303

    10.7.2.2 System Analyzer 303

    10.8 Laboratory sessions 305

    10.8.1 Laboratory experiment 1: Using the RTOS ROV 305

    10.8.2 Laboratory experiment 2: Using the RTOS Analyzer 305

    10.8.3 Laboratory experiment 3: Using the System Analyzer 312

    10.8.4 Laboratory experiment 4: Using diagnosis features 314

    10.8.5 Laboratory experiment 5: Using a diagnostic feature with filtering 317

    10.9 Conclusion 321

    References 322

    Further reading 323

    11 Bootloader for KeyStone I and KeyStone II 324

    11.1 Introduction 324

    11.2 How to start the boot process 325

    11.3 The boot process 325

    11.4 ROM Bootloader (RBL) 328

    11.4.1 The boot configuration format 336

    11.4.1.1 Creating the boot parameter table 336

    11.4.1.2 Creating the boot table 338

    11.4.1.3 The boot configuration table 338

    11.5 Boot process 340

    11.5.1 Initialisation stage for the KeyStone I 340

    11.5.2 Second-level bootloader 341

    11.5.2.1 Intermediate bootloader 341

    11.5.2.2 How to use the IBL 342

    11.6 Laboratory experiment 1 345

    11.6.1 Initialisation stage for the KeyStone II 350

    11.6.1.1 Bootloader initialisation after power-on reset 350

    11.6.1.2 Bootloader initialisation process after hard or soft reset 350

    11.6.2 Second bootloader for the KeyStone II 350

    11.6.2.1 U-Boot 351

    11.7 Laboratory experiment 2 352

    11.7.1 Printing the U-Boot environment 360

    11.7.2 Using the help for U-Boot 362

    11.8 TFTP boot with a host-mounted Network File System (NFS) server – NFS booting 363

    11.8.1 Laboratory experiment 3 364

    11.9 Conclusion 372

    References 372

    12 Introduction to OpenMP 374

    12.1 Introduction to OpenMP 375

    12.2 Directive formats 376

    12.3 Forking region 377

    12.3.1 omp parallel – parallel region construct 377

    12.3.1.1 Clause descriptions 378

    12.4 Work-sharing constructs 382

    12.4.1 omp for 382

    12.4.1.1 OpenMP loop scheduling 383

    12.4.2 omp sections 385

    12.4.3 omp single 386

    12.4.4 omp master 386

    12.4.5 omp task 387

    12.5 Environment variables and library functions 390

    12.6 Synchronisation constructs 392

    12.6.1 atomic 393

    12.6.1.1 Clauses 393

    12.6.2 barrier 395

    12.6.3 critical 396

    12.7 OpenMP accelerator model 397

    12.7.1 Supported OpenMP device constructs 397

    12.7.1.1 #pragma omp target 397

    12.7.1.2 #pragma omp target data 399

    12.7.1.3 #pragma omp target update 400

    12.7.1.4 #pragma omp declare target 401

    12.8 Laboratory experiments 402

    12.8.1 Laboratory experiment 1 402

    12.8.2 Laboratory experiment 2 402

    12.8.3 Laboratory experiment 3 404

    12.8.4 Laboratory experiment 4 405

    12.8.5 Laboratory experiment 5 405

    12.9 Conclusion 417

    References 419

    13 Introduction to OpenCL for the KeyStone II 420

    13.1 Introduction 421

    13.2 Operation of OpenCL 421

    13.3 Command queue 424

    13.3.1 Creating a command queue 427

    13.3.1.1 Command-queue properties 429

    13.3.2 Enqueueing a kernel 430

    13.4 Kernel declaration 431

    13.5 How do the kernels access data? 431

    13.6 OpenCL memory model for the KeyStone 432

    13.6.1 Creating a buffer 433

    13.6.1.1 Cl_mem_flags 434

    13.7 Synchronisation 435

    13.7.1 Event with a callback function 436

    13.7.2 User event 439

    13.7.3 Waiting for one command or all commands to finish 439

    13.7.4 wait_group_events 440

    13.7.5 Barrier 440

    13.8 Basic debugging profiling 440

    13.9 OpenMP dispatch from OpenCL 443

    13.9.1 OpenMP for the kernel code 443

    13.9.2 OpenMP for the ARM code 443

    13.10 Building the OpenCL project 444

    13.11 Laboratory experiments 445

    13.11.1 Laboratory experiment 1: Hello World 446

    13.11.2 Laboratory experiment 2: dotp functions 454

    13.11.2.1 Explore the main.cpp function 454

    13.11.2.2 Explore the kernel dotp.cl 459

    13.11.2.3 Run the dotp program 460

    13.11.3 Laboratory experiment 3: USE_HOST_PTR 460

    13.11.4 Laboratory experiment 4: ALLOC_HOST_PTR 463

    13.11.5 Laboratory experiment 5: COPY_HOST_PTR 465

    13.11.6 Laboratory experiment 6: Synchronisation 467

    13.11.7 Laboratory experiment 7: Local buffer 473

    13.11.8 Laboratory experiment 8: Barrier 477

    13.11.9 Laboratory experiment 9: Profiling 479

    13.11.10 Laboratory experiment 10: OpenMP in kernel 484

    13.11.11 Laboratory experiment 11: OpenMP in ARM 487

    13.12 Conclusion 489

    References 490

    14 Multicore Navigator 491

    14.1 Introduction 491

    14.2 Navigator architecture 492

    14.2.1 The PKDMA 494

    14.2.1.1 PKDMA transmit side 495

    14.2.1.2 PKDMA receive side 495

    14.2.1.3 Infrastructure PKDMA 497

    14.2.2 Descriptors 497

    14.2.2.1 Host packet descriptors 498

    14.2.2.2 Monolithic packet descriptor 498

    14.2.2.3 Setting up the memory regions for the descriptors 498

    14.2.3 Queue Manager Subsystem 500

    14.2.4 Queue Manager 503

    14.2.4.1 Queue peek registers 503

    14.2.4.2 Link RAM 504

    14.2.5 Accumulator packet data structure processors 504

    14.2.5.1 Accumulation 506

    14.2.5.2 Quality of service 506

    14.2.5.3 Event management (resource sharing and job load balancing) 506

    14.2.6 Interrupt distributor module 506

    14.3 Complete functionality of the Navigator 506

    14.4 Laboratory experiment 511

    14.5 Conclusion 513

    References 514

    15 FIR filter implementation 515

    15.1 Introduction 515

    15.2 Properties of an FIR filter 516

    15.2.1 Filter coefficients 516

    15.2.2 Frequency response of an FIR filter 516

    15.2.3 Phase linearity of an FIR filter 517

    15.3 Design procedure 518

    15.3.1 Specifications 518

    15.3.2 Coefficients calculation 519

    15.3.2.1 Window method 519

    15.3.3 Realisation structure 522

    15.3.3.1 Direct structure 525

    15.3.3.2 Linear phase structures 525

    15.3.3.3 Cascade structures 527

    15.4 Laboratory experiments 528

    15.4.1 Filter implementation 529

    15.4.2 Synchronisation 530

    15.4.3 Building and running the DSP project 532

    15.4.4 Building and running the PC project 534

    15.5 Conclusion 540

    References 540

    16 IIR filter implementation 542

    16.1 Introduction 542

    16.2 Design procedure 543

    16.3 Coefficients calculation 543

    16.3.1 Pole–zero placement approach 543

    16.3.2 Analogue-to-digital filter design 543

    16.3.3 Bilinear transform (BZT) method 544

    16.3.3.1 Practical example of the bilinear transform method 547

    16.3.3.2 Coefficients calculation 547

    16.3.3.3 Realisation structures 548

    16.3.4 Impulse invariant method 552

    16.3.4.1 Practical example of the impulse invariant method 553

    16.4 IIR filter implementation 556

    16.5 Laboratory experiment 561

    16.6 Conclusion 561

    Reference 562

    17 Adaptive filter implementation 563

    17.1 Introduction 563

    17.2 Mean square error 564

    17.3 Least mean square 565

    17.4 Implementation of an adaptive filter using the LMS algorithm 565

    17.5 Implementation using linear assembly 567

    17.6 Implementation in C language with compiler switches 572

    17.7 Laboratory experiment 572

    17.8 Conclusion 573

    References 573

    18 FFT implementation 574

    18.1 Introduction 574

    18.2 FFT algorithm 574

    18.2.1 Fourier series 574

    18.2.2 Fourier transform 575

    18.2.3 Discrete Fourier transform 575

    18.2.4 Fast Fourier transform 576

    18.2.4.1 Splitting the DFT into two DFTs 576

    18.2.4.2 Exploiting the periodicity and symmetry of the twiddle factors 577

    18.3 FFT implementation 579

    18.4 Laboratory experiment 582

    18.4.1 Part 1: Implementation of DIF FFT 582

    18.4.2 Part 2: Using ping-pong EDMA 585

    18.5 Conclusion 590

    References 590

    19 Hough transform 591

    19.1 Introduction 591

    19.2 Theory 591

    19.3 Limits of r and θ 593

    19.4 Hough transform implementation 595

    19.5 Laboratory experiment 596

    19.6 Conclusion 603

    References 603

    20 Stereo vision implementation 604

    20.1 Introduction 604

    20.2 Algorithm for performing depth calculation 605

    20.3 Cost functions 606

    20.4 Implementation 607

    20.4.1 Laboratory experiment 610

    20.4.1.1 SAD implementation 610

    20.4.1.2 NCC implementation 611

    20.4.1.3 ZNCC implementation 611

    20.5 Conclusion 613

    References 616

    Index 617

Multicore DSP

    Product form

    £92.10

    Includes FREE delivery

    RRP £96.95 – you save £4.85 (5%)

    Order before 4pm today for delivery by Mon 6 Jul 2026.

    A Hardback by Naim Dahnoun

      Trusted by thousands of customers. See 2,385+ Customer Reviews

      View other formats and editions of Multicore DSP by Naim Dahnoun

      Publisher: John Wiley & Sons Inc
      Publication Date: 02/02/2018
      ISBN13: 9781119003823, 978-1119003823
      ISBN10: 1119003822

      Description

      Book Synopsis

      The only book to offer special coverage of the fundamentals of multicore DSP for implementation on the TMS320C66xx SoC

      This unique book provides readers with an understanding of the TMS320C66xx SoC as well as its constraints. It offers critical analysis of each element, which not only broadens their knowledge of the subject, but aids them in gaining a better understanding of how these elements work so well together.

      Written by Texas Instruments' First DSP Educator Award winner, Naim Dahnoun, the book teaches readers how to use the development tools, take advantage of the maximum performance and functionality of this processor and have an understanding of the rich content which spans from architecture, development tools and programming models, such as OpenCL and OpenMP, to debugging tools. It also covers various multicore audio and image applications in detail. Additionally, this one-of-a-kind book is supplemented with:

      • A rich set of tested laborator

        Table of Contents

        Preface xviii

        Acknowledgements xxi

        Foreword xxii

        About the Companion Website xxiii

        1 Introduction to DSP 1

        1.1 Introduction 1

        1.2 Multicore processors 3

        1.2.1 Can any algorithm benefit from a multicore processor? 3

        1.2.2 How many cores do I need for my application? 5

        1.3 Key applications of high-performance multicore devices 6

        1.4 FPGAs, Multicore DSPs, GPUs and Multicore CPUs 8

        1.5 Challenges faced for programming a multicore processor 9

        1.6 Texas Instruments DSP roadmap 10

        1.7 Conclusion 11

        References 12

        2 The TMS320C66x architecture overview 14

        2.1 Overview 14

        2.2 The CPU 15

        2.2.1 Cross paths 16

        2.2.1.1 Data cross paths 17

        2.2.1.2 Address cross paths 18

        2.2.2 Register file A and file B 20

        2.2.2.1 Operands 20

        2.2.3 Functional units 21

        2.2.3.1 Condition registers 21

        2.2.3.2 .L units 22

        2.2.3.3 .M units 22

        2.2.3.4 .S units 23

        2.2.3.5 .D units 23

        2.3 Single instruction, multiple data (SIMD) instructions 24

        2.3.1 Control registers 24

        2.4 The KeyStone memory 24

        2.4.1 Using the internal memory 27

        2.4.2 Memory protection and extension 29

        2.4.3 Memory throughput 29

        2.5 Peripherals 30

        2.5.1 Navigator 32

        2.5.2 Enhanced Direct Memory Access (EDMA) Controller 32

        2.5.3 Universal Asynchronous Receiver/Transmitter (UART) 32

        2.5.4 General purpose input–output (GPIO) 32

        2.5.5 Internal timers 32

        2.6 Conclusion 33

        References 33

        3 Software development tools and the TMS320C6678 EVM 35

        3.1 Introduction 35

        3.2 Software development tools 37

        3.2.1 Compiler 38

        3.2.2 Assembler 39

        3.2.3 Linker 40

        3.2.3.1 Linker command file 40

        3.2.4 Compile, assemble and link 42

        3.2.5 Using the Real-Time Software Components (RTSC) tools 42

        3.2.5.1 Platform update using the XDCtools 42

        3.2.6 KeyStone Multicore Software Development Kit 47

        3.3 Hardware development tools 47

        3.3.1 EVM features 47

        3.4 Laboratory experiments based on the C6678 EVM: introduction to Code Composer Studio (CCS) 51

        3.4.1 Software and hardware requirements 51

        3.4.1.1 Key features 52

        3.4.1.2 Download sites 53

        3.4.2 Laboratory experiments with the CCS6 53

        3.4.2.1 Introduction to CCS 55

        3.4.2.2 Implementation of a DOTP algorithm 63

        3.4.3 Profiling using the clock 65

        3.4.4 Considerations when measuring time 67

        3.5 Loading different applications to different cores 67

        3.6 Conclusion 72

        References 72

        4 Numerical issues 74

        4.1 Introduction 74

        4.2 Fixed- and floating-point representations 75

        4.2.1 Fixed-point arithmetic 76

        4.2.1.1 Unsigned integer 76

        4.2.1.2 Signed integer 77

        4.2.1.3 Fractional numbers 77

        4.2.2 Floating-point arithmetic 78

        4.2.2.1 Special numbers for the 32-bit and 64-bit floating-point formats 81

        4.3 Dynamic range and accuracy 82

        4.4 Laboratory exercise 83

        4.5 Conclusion 85

        References 85

        5 Software optimisation 86

        5.1 Introduction 86

        5.2 Hindrance to software scalability for a multicore processor 88

        5.3 Single-core code optimisation procedure 88

        5.3.1 The C compiler options 90

        5.4 Interfacing C with intrinsics, linear assembly and assembly 91

        5.4.1 Intrinsics 91

        5.4.2 Interfacing C and assembly 92

        5.5 Assembly optimisation 97

        5.5.1 Parallel instructions 98

        5.5.2 Removing the NOPs 99

        5.5.3 Loop unrolling 99

        5.5.4 Double-Word Access 100

        5.5.5 Optimisation summary 100

        5.6 Software pipelining 101

        5.6.1 Software-pipelining procedure 105

        5.6.1.1 Writing linear assembly code 105

        5.6.1.2 Creating a dependency graph 105

        5.6.1.3 Resource allocation 108

        5.6.1.4 Scheduling table 108

        5.6.1.5 Generating assembly code 109

        5.7 Linear assembly 111

        5.7.1 Hand optimisation of the dotp function using linear assembly 112

        5.8 Avoiding memory banks 118

        5.9 Optimisation using the tools 118

        5.10 Laboratory experiments 123

        5.11 Conclusion 126

        References 126

        6 The TMS320C66x interrupts 127

        6.1 Introduction 127

        6.1.1 Chip-level interrupt controller 129

        6.2 The interrupt controller 135

        6.3 Laboratory experiment 140

        6.3.1 Experiment 1: Using the GIPIOs to trigger some functions 140

        6.3.2 Experiment 2: Using the console to trigger an interrupt 140

        6.4 Conclusion 143

        References 144

        7 Real-time operating system: TI-RTOS 145

        7.1 Introduction 146

        7.2 TI-RTOS 146

        7.3 Real-time scheduling 148

        7.3.1 Hardware interrupts (Hwis) 148

        7.3.1.1 Setting an Hwi 149

        7.3.1.2 Hwi hook functions 149

        7.3.2 Software interrupts (Swis), including clock, periodic or single-shot functions 155

        7.3.3 Tasks 155

        7.3.3.1 Task hook functions 157

        7.3.4 Idle functions 158

        7.3.5 Clock functions 158

        7.3.6 Timer functions 158

        7.3.7 Synchronisation 158

        7.3.7.1 Semaphores 159

        7.3.7.2 Semaphore_pend 159

        7.3.7.3 Semaphore_post 159

        7.3.7.4 How to configure the semaphores 159

        7.3.8 Events 159

        7.3.9 Summary 163

        7.4 Dynamic memory management 163

        7.4.1 Stack allocation 165

        7.4.2 Heap allocation 165

        7.4.3 Heap implementation 165

        7.4.3.1 HeapMin implementation 165

        7.4.3.2 HeapMem implementation 165

        7.4.3.3 HeapBuf implementation 167

        7.4.3.4 HeapMultiBuf implementation 171

        7.5 Laboratory experiments 172

        7.5.1 Lab 1: Manual setup of the clock (part 1) 172

        7.5.2 Lab 2: Manual setup of the clock (part 2) 172

        7.5.3 Lab 3: Using Hwis, Swis, tasks and clocks 174

        7.5.4 Lab 4: Using events 187

        7.5.5 Lab 5: Using the heaps 189

        7.6 Conclusion 190

        References 191

        References (further reading) 191

        8 Enhanced Direct Memory Access (EDMA3) controller 192

        8.1 Introduction 192

        8.2 Type of DMAs available 193

        8.3 EDMA controllers architecture 194

        8.3.1 The EDMA3 Channel Controller (EDMA3CC) 194

        8.3.2 The EDMA3 transfer controller (EDMA3TC) 201

        8.3.3 EDMA prioritisation 201

        8.3.3.1 Trigger source priority 202

        8.3.3.2 Channel priority 203

        8.3.3.3 Dequeue priority 203

        8.3.3.4 System (transfer controller) priority 203

        8.4 Parameter RAM (PaRAM) 203

        8.4.1 Channel options parameter (OPT) 203

        8.5 Transfer synchronisation dimensions 203

        8.5.1 A – Synchronisation 204

        8.5.2 AB – Synchronisation 204

        8.6 Simple EDMA transfer 204

        8.7 Chaining EDMA transfers 208

        8.8 Linked EDMAs 208

        8.9 Laboratory experiments 210

        8.9.1 Laboratory 1: Simple EDMA transfer 211

        8.9.2 Laboratory 2: EDMA chaining transfer 211

        8.9.3 Laboratory 3: EDMA link transfer 213

        8.10 Conclusion 213

        References 213

        9 Inter-Processor Communication (IPC) 214

        9.1 Introduction 215

        9.2 Texas Instruments IPC 217

        9.3 Notify module 219

        9.3.1 Laboratory experiment 222

        9.4 MessageQ 222

        9.4.1 MessageQ protocol 224

        9.4.2 Message priority 229

        9.4.3 Thread synchronisation 229

        9.5 ListMP module 233

        9.6 GateMP module 234

        9.6.1 Initialising a GateMP parameter structure 234

        9.6.1.1 Types of gate protection 235

        9.6.2 Creating a GateMP instance 236

        9.6.3 Entering a GateMP 236

        9.6.4 Leaving a gate 236

        9.6.5 The list of functions that can be used by GateMP 237

        9.7 Multi-processor Memory Allocation: HeapBufMP, HeapMemMP and HeapMultiBufMP 237

        9.7.1 HeapBuf_Params 238

        9.7.2 HeapMem_Params 239

        9.7.3 HeapMultiBuf_Params 239

        9.7.4 Configuration example for HeapMultiBuf 239

        9.8 Transport mechanisms for the IPC 241

        9.9 Laboratory experiments with KeyStone I 241

        9.9.1 Laboratory 1: Using MessageQ with multiple cores 241

        9.9.1.1 Overview 242

        9.9.2 Laboratory 2: Using ListMP, ShareRegion and GateMP 243

        9.10 Laboratory experiments with KeyStone II 249

        9.10.1 Laboratory experiment 1: Transferring a block of data 249

        9.10.1.1 Set the connection between the host (PC) and the KeyStone 249

        9.10.1.2 Explore the ARM code 250

        9.10.1.3 Explore the DSP code 259

        9.10.1.4 Compile and run the program 263

        9.10.2 Laboratory experiment 2: Transferring a pointer 267

        9.10.2.1 Explore the ARM code 267

        9.10.2.2 Explore the DSP code 271

        9.10.2.3 Compile and run the program 278

        9.11 Conclusion 278

        References 278

        10 Single and multicore debugging 280

        10.1 Introduction 281

        10.2 Software and hardware debugging 282

        10.3 Debug architecture 282

        10.3.1 Trace 282

        10.3.1.1 Standard trace 282

        10.3.1.2 Event trace 283

        10.3.1.3 System trace 285

        10.4 Advanced Event Triggering 286

        10.4.1 Advanced Event Triggering logic 289

        10.4.2 Unified Breakpoint Manager 294

        10.5 Unified Instrumentation Architecture 295

        10.5.1 Host-side tooling 295

        10.5.2 Target-side tooling 295

        10.5.2.1 Software instrumentation APIs 297

        10.5.2.2 Predefined software events and metadata 297

        10.5.2.3 Event loggers 297

        10.5.2.4 Transports 297

        10.5.2.5 SYS/BIOS event capture and transport 297

        10.5.2.6 Multicore support 297

        10.6 Debugging with the System Analyzer tools 298

        10.6.1 Target-side coding with UIA APIs and the XDCtools 299

        10.6.2 Logging events with Log_write() functions 300

        10.6.3 Advance debugging using the diagnostic feature 301

        10.6.4 LogSnapshot APIs for logging state information 302

        10.7 Instrumentation with TI-RTOS and CCS 302

        10.7.1 Using RTOS Object Viewer 302

        10.7.2 Using the RTOS Analyzer and the System Analyzer 303

        10.7.2.1 RTOS Analyzer 303

        10.7.2.2 System Analyzer 303

        10.8 Laboratory sessions 305

        10.8.1 Laboratory experiment 1: Using the RTOS ROV 305

        10.8.2 Laboratory experiment 2: Using the RTOS Analyzer 305

        10.8.3 Laboratory experiment 3: Using the System Analyzer 312

        10.8.4 Laboratory experiment 4: Using diagnosis features 314

        10.8.5 Laboratory experiment 5: Using a diagnostic feature with filtering 317

        10.9 Conclusion 321

        References 322

        Further reading 323

        11 Bootloader for KeyStone I and KeyStone II 324

        11.1 Introduction 324

        11.2 How to start the boot process 325

        11.3 The boot process 325

        11.4 ROM Bootloader (RBL) 328

        11.4.1 The boot configuration format 336

        11.4.1.1 Creating the boot parameter table 336

        11.4.1.2 Creating the boot table 338

        11.4.1.3 The boot configuration table 338

        11.5 Boot process 340

        11.5.1 Initialisation stage for the KeyStone I 340

        11.5.2 Second-level bootloader 341

        11.5.2.1 Intermediate bootloader 341

        11.5.2.2 How to use the IBL 342

        11.6 Laboratory experiment 1 345

        11.6.1 Initialisation stage for the KeyStone II 350

        11.6.1.1 Bootloader initialisation after power-on reset 350

        11.6.1.2 Bootloader initialisation process after hard or soft reset 350

        11.6.2 Second bootloader for the KeyStone II 350

        11.6.2.1 U-Boot 351

        11.7 Laboratory experiment 2 352

        11.7.1 Printing the U-Boot environment 360

        11.7.2 Using the help for U-Boot 362

        11.8 TFTP boot with a host-mounted Network File System (NFS) server – NFS booting 363

        11.8.1 Laboratory experiment 3 364

        11.9 Conclusion 372

        References 372

        12 Introduction to OpenMP 374

        12.1 Introduction to OpenMP 375

        12.2 Directive formats 376

        12.3 Forking region 377

        12.3.1 omp parallel – parallel region construct 377

        12.3.1.1 Clause descriptions 378

        12.4 Work-sharing constructs 382

        12.4.1 omp for 382

        12.4.1.1 OpenMP loop scheduling 383

        12.4.2 omp sections 385

        12.4.3 omp single 386

        12.4.4 omp master 386

        12.4.5 omp task 387

        12.5 Environment variables and library functions 390

        12.6 Synchronisation constructs 392

        12.6.1 atomic 393

        12.6.1.1 Clauses 393

        12.6.2 barrier 395

        12.6.3 critical 396

        12.7 OpenMP accelerator model 397

        12.7.1 Supported OpenMP device constructs 397

        12.7.1.1 #pragma omp target 397

        12.7.1.2 #pragma omp target data 399

        12.7.1.3 #pragma omp target update 400

        12.7.1.4 #pragma omp declare target 401

        12.8 Laboratory experiments 402

        12.8.1 Laboratory experiment 1 402

        12.8.2 Laboratory experiment 2 402

        12.8.3 Laboratory experiment 3 404

        12.8.4 Laboratory experiment 4 405

        12.8.5 Laboratory experiment 5 405

        12.9 Conclusion 417

        References 419

        13 Introduction to OpenCL for the KeyStone II 420

        13.1 Introduction 421

        13.2 Operation of OpenCL 421

        13.3 Command queue 424

        13.3.1 Creating a command queue 427

        13.3.1.1 Command-queue properties 429

        13.3.2 Enqueueing a kernel 430

        13.4 Kernel declaration 431

        13.5 How do the kernels access data? 431

        13.6 OpenCL memory model for the KeyStone 432

        13.6.1 Creating a buffer 433

        13.6.1.1 Cl_mem_flags 434

        13.7 Synchronisation 435

        13.7.1 Event with a callback function 436

        13.7.2 User event 439

        13.7.3 Waiting for one command or all commands to finish 439

        13.7.4 wait_group_events 440

        13.7.5 Barrier 440

        13.8 Basic debugging profiling 440

        13.9 OpenMP dispatch from OpenCL 443

        13.9.1 OpenMP for the kernel code 443

        13.9.2 OpenMP for the ARM code 443

        13.10 Building the OpenCL project 444

        13.11 Laboratory experiments 445

        13.11.1 Laboratory experiment 1: Hello World 446

        13.11.2 Laboratory experiment 2: dotp functions 454

        13.11.2.1 Explore the main.cpp function 454

        13.11.2.2 Explore the kernel dotp.cl 459

        13.11.2.3 Run the dotp program 460

        13.11.3 Laboratory experiment 3: USE_HOST_PTR 460

        13.11.4 Laboratory experiment 4: ALLOC_HOST_PTR 463

        13.11.5 Laboratory experiment 5: COPY_HOST_PTR 465

        13.11.6 Laboratory experiment 6: Synchronisation 467

        13.11.7 Laboratory experiment 7: Local buffer 473

        13.11.8 Laboratory experiment 8: Barrier 477

        13.11.9 Laboratory experiment 9: Profiling 479

        13.11.10 Laboratory experiment 10: OpenMP in kernel 484

        13.11.11 Laboratory experiment 11: OpenMP in ARM 487

        13.12 Conclusion 489

        References 490

        14 Multicore Navigator 491

        14.1 Introduction 491

        14.2 Navigator architecture 492

        14.2.1 The PKDMA 494

        14.2.1.1 PKDMA transmit side 495

        14.2.1.2 PKDMA receive side 495

        14.2.1.3 Infrastructure PKDMA 497

        14.2.2 Descriptors 497

        14.2.2.1 Host packet descriptors 498

        14.2.2.2 Monolithic packet descriptor 498

        14.2.2.3 Setting up the memory regions for the descriptors 498

        14.2.3 Queue Manager Subsystem 500

        14.2.4 Queue Manager 503

        14.2.4.1 Queue peek registers 503

        14.2.4.2 Link RAM 504

        14.2.5 Accumulator packet data structure processors 504

        14.2.5.1 Accumulation 506

        14.2.5.2 Quality of service 506

        14.2.5.3 Event management (resource sharing and job load balancing) 506

        14.2.6 Interrupt distributor module 506

        14.3 Complete functionality of the Navigator 506

        14.4 Laboratory experiment 511

        14.5 Conclusion 513

        References 514

        15 FIR filter implementation 515

        15.1 Introduction 515

        15.2 Properties of an FIR filter 516

        15.2.1 Filter coefficients 516

        15.2.2 Frequency response of an FIR filter 516

        15.2.3 Phase linearity of an FIR filter 517

        15.3 Design procedure 518

        15.3.1 Specifications 518

        15.3.2 Coefficients calculation 519

        15.3.2.1 Window method 519

        15.3.3 Realisation structure 522

        15.3.3.1 Direct structure 525

        15.3.3.2 Linear phase structures 525

        15.3.3.3 Cascade structures 527

        15.4 Laboratory experiments 528

        15.4.1 Filter implementation 529

        15.4.2 Synchronisation 530

        15.4.3 Building and running the DSP project 532

        15.4.4 Building and running the PC project 534

        15.5 Conclusion 540

        References 540

        16 IIR filter implementation 542

        16.1 Introduction 542

        16.2 Design procedure 543

        16.3 Coefficients calculation 543

        16.3.1 Pole–zero placement approach 543

        16.3.2 Analogue-to-digital filter design 543

        16.3.3 Bilinear transform (BZT) method 544

        16.3.3.1 Practical example of the bilinear transform method 547

        16.3.3.2 Coefficients calculation 547

        16.3.3.3 Realisation structures 548

        16.3.4 Impulse invariant method 552

        16.3.4.1 Practical example of the impulse invariant method 553

        16.4 IIR filter implementation 556

        16.5 Laboratory experiment 561

        16.6 Conclusion 561

        Reference 562

        17 Adaptive filter implementation 563

        17.1 Introduction 563

        17.2 Mean square error 564

        17.3 Least mean square 565

        17.4 Implementation of an adaptive filter using the LMS algorithm 565

        17.5 Implementation using linear assembly 567

        17.6 Implementation in C language with compiler switches 572

        17.7 Laboratory experiment 572

        17.8 Conclusion 573

        References 573

        18 FFT implementation 574

        18.1 Introduction 574

        18.2 FFT algorithm 574

        18.2.1 Fourier series 574

        18.2.2 Fourier transform 575

        18.2.3 Discrete Fourier transform 575

        18.2.4 Fast Fourier transform 576

        18.2.4.1 Splitting the DFT into two DFTs 576

        18.2.4.2 Exploiting the periodicity and symmetry of the twiddle factors 577

        18.3 FFT implementation 579

        18.4 Laboratory experiment 582

        18.4.1 Part 1: Implementation of DIF FFT 582

        18.4.2 Part 2: Using ping-pong EDMA 585

        18.5 Conclusion 590

        References 590

        19 Hough transform 591

        19.1 Introduction 591

        19.2 Theory 591

        19.3 Limits of r and θ 593

        19.4 Hough transform implementation 595

        19.5 Laboratory experiment 596

        19.6 Conclusion 603

        References 603

        20 Stereo vision implementation 604

        20.1 Introduction 604

        20.2 Algorithm for performing depth calculation 605

        20.3 Cost functions 606

        20.4 Implementation 607

        20.4.1 Laboratory experiment 610

        20.4.1.1 SAD implementation 610

        20.4.1.2 NCC implementation 611

        20.4.1.3 ZNCC implementation 611

        20.5 Conclusion 613

        References 616

        Index 617

      Recently viewed products

      © 2026 Book Curl

        • American Express
        • Apple Pay
        • Diners Club
        • Discover
        • Google Pay
        • Maestro
        • Mastercard
        • PayPal
        • Shop Pay
        • Union Pay
        • Visa

        Login

        Forgot your password?

        Don't have an account yet?
        Create account