FPGA implementation of a near-ML sphere detector for 802...

transcript

FPGA implementation of a near-ML sphere detector for 802.16e broadband wireless systems

chris dick: XilinxMilos Trajkovic, Slobodan Denic, Dragan Vuletic: Signum ConceptsRaghu Rao: Xilinxfred harris: Signum Concepts & San Diego State UniversityKiarash Amiri: Rice University

Agenda

Spatial multiplexing MIMO Quasi-ML SM decoders and sphere detection Useful approximations for efficient hardware implementation Guided tour of sphere detector implementation for 802.16e Implementation results

802.16e Sphere Detector Implementation

Specifications– 5MHz WiMAX– 360 data sub-carriers– antenna ordering to be updated for every sub-carrier for every OFDM symbol

• extremely rapid update; could probably be relaxed

– 2,3 or four antennas programmable dynamically at run-time– QPSK, 16-QAM and 64-QAM configurable on a per user basis

... ...

MIMO channel

... ...

In OutDACRF

ADCRF... ...

11 12 13

21 22 23

31 32 33

h h hh h hh h h

TX ant num

FPGA Sphere Detector Effective FPGA implementation is intersection of:

Algorithm optimizations• K-best, depth-first search, sort-free,…

Architecture• clock rates• hardware folding

Micro-architecture optimizations that are possible with the FPGA• bit-width optimizations• math optimizations: norms to simplify hardware footprint for

sphere detector for example

1 2vs vs

Sphere Detection – QRD Pre-processing

Maximum Likelihood detection

Key element of complexity reduction is suitable factoring of channel matrix H

MR x MT matrix Q has orthonormal columns and R is upper triangular Distance metric can be written as

2 ZFˆ ˆ( ) with Hd s c y y Rs y Q Rs 'y R s 2||||

2ˆ arg minMTs

2ˆ y R s

Example: Solving for Antenna 2, Real Component of Symbol

Solve for symbol transmitted on antenna 1 by performing hypothesis test– recall that for QPSK the coordinates for the constellation points are = {±3,±1}

Form all of the possible required products Rs and differences as shown

Now make a decision for the real component of the symbol sent on antenna 1

1,1 1,2 1,31

2 2,32

0 00 ˆ0

MR R RyR Ry

1 3 1 1 3ˆ min , , ,TMs d s d s d s d s

1,1 1,2 1,31

2 2,32

0 00 ˆ0

MR R RyR Ry

1,1 1,2 1,31

2 2,32

0 00 ˆ0

MR R RyR Ry

1,1 1,2 1,31

2 2,32

0 00 ˆ0

MR R RyR Ry

Process of Sphere Decoding

The decoding process can be mapped to finding a shortest path (with minimum Euclidean distance) in a tree topology

decodingsequence

Computational complexity for one solution Adds: M(M+1)/2 + 2M − 1 Mults: M(M+1)/2 complex mults 2M square operators Mem for H: M2 complex values

E.g. 16×16 systems Nadd = 167 Nmult = 136 complex mults 32 square operators Mem for H: 256 complex values

1 1,1 1 1,2 2 1,

2 2 2,2 2 2,

21,1 1

T T TT

M M M M

y R s R s R s

y R s R s

ant-1ant-2

ant-MT

common term for each row

Mapping Distance Computation to FPGA

The actual system we are considering is 4x4 and the matrix equation looks like

0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,70

1,1 1,2 1,3 1,4 1,5 1,6 1,71

2,2 2,3 2,4 2,5 2,6 2,72

3,3 3,4 3,5 3,6 3,73

4,4 4,5 4,6 4,74

5,5 5,6 5,75

6,6 6,76

R R R R R R R RyR R R R R R Ry

R R R R R RyR R R R Ry

R R R RyR R Ry

R RyRy

ssssssss

The element wise product and subtraction maps well to the DSP48 element in the FPGA

Leverages the dedicated routing between cascaded DSP48’s to minimize the critical path and hence maximize the FPGA clock frequency

DSP48 tile contains a multiplier for computing the products and add/sub for accumulating the partial-products

Symbol

Dedicated high-speed interconnect in Xilinx FPGA

SymbolPath MetricUpdate

Sphere detector essentially a streaming dataflow computation Parallel processing pipeline for computing the incremental distance metric

Sphere Detection Tree Example: MT =4 antennas, K=3, 16-QAM

K survivors at each level requires sort

→ high-latency process

25 5,7 7 5,6 6 5,5 5y R s R s R s i=6

26 6,7 7 6,6 6y R s R s i=7

27 7,7 7y R s

MIN value: Detected Vector

4ˆRe s 4ˆIm s

2ˆIm s

2ˆRe s

FlexSphere Detection Tree4x4 64-QAM

Two levels in tree correspond to one antenna First 2 levels fully expanded Expansion of the minimum node rather sorting whole level

– approximate sort avoids requirement for high-latency min-search at each level

Micro-Architecture Optimizations to Reduce Cost: Simplified Norms

Motivation for using simplified norm is to reduce complexity of implementation Minimize FPGA resource footprint Mechanism that permits tradeoff

between FPGA architectural elements, e.g. DSP48’s for LUTs

Let partial sphere constraint is

X X e r

approximate the norm with a different norm

X f X e

In the complex-valued case, corresponding approximations for are

norm: or

norm: max ,

L-1, L-2 Norm Comparison

Negligible impact using simpler L-1 norm versus L-2 norm

K-Best sphere detector is quasi MLD solution Already some BER degradation introduced with K-Best

Eliminates MPYs in implementation which may be beneficial

0 5 10 15 20 25 3010

EbNo [dB]

16 QAM, K-best, 2X2

K-best (K=4), L-2 normK-best (K=4), L-1 normK-best (K=6), L-2 normK-best (K=6), L-1 normML

Floating-point simulation

802.16e Sphere DetectorAntenna Ordering

Figure below shows the antenna ordering pipeline

1† * *

1 1 ,[ ]

arg for 1 arg for 1

max min

j j j j j T

j j j jk kk k

k j k j

G H H H H

H H H H

2, 2G i

2,1G i

2,3G i

2, 4G i

norm calculations

802.16e Sphere DetectorTop-Level Block Diagram

Key to achieving good BER performance is around how the antenna ordering is performed– the ‘V-BLAST channel reordering’ component above

Assume we have the channel estimate

H sortedH

802.16e Sphere DetectorAntenna Ordering Processing Pipeline

Linear processing pipeline of MT-1 stages employed Each stage: QRD matrix inversion, norm, updated H Buffer memories store intermediate results and support the

TDM pipeline

Antenna ordering: Top Each stage of the pipeline

realized using a QRD-style systolic array

G matrix calc

Norm search

Update Hsorted

G matrix calc

Norm search

Update Hsorted

G matrix calc

Norm search

Update Hsorted

360*32x18b

5*16*2x18b

4X4 matrix calculations

Matrix Inversion using complex QRDPipeline Processing

QRD using Givens rotations

Angle estimation and vector rotation performed using complex multiplication and approximation

High-speed pipelined architecture using Xilinx DSP tiles (DSP48)

Heavy pipelining and time division multiplexing of hardware employed to meet latency/throughput requirements of 802.16e

Channel Matrix H 4x4 Configuration

3,0h 3,1h

3,1 3,12multication3,0performed ininner cell

2 2I QCmplx.Conj.

To Inner Cells

1( ) ( ) ( );

( ) and ( ) stored in FPGA BRAMaddition and ( ) realized using DSP48

f x x f x x f x f xx

f x f xx f x

3,1h 3,1h

Cmplx.Conj.

SaveCurrent

ReadPrevious

To Inner Cells

First Set of Rotations

Perform rotations to successively null matrix entries

Approximating 1/sqrt(x)f(x+x) = f(x) + x f’(x); f(x) = 1/sqrt(x)

0 0.5 1 1.5 2 2.5 3x 105

1.5Floating-point Matlab and Floating-point Approximation

Floating point MatlabFloating point approximation

0 0.5 1 1.5 2 2.5 3x 105

1.5Floating-Point Matlab and Fixed-Point Approximation

Floating point MatlabFixed point approximation

0.5 1 1.5 2-4

8x 10-6 Difference Function - Floating-Point

0.5 1 1.5 2-1.5

1.5x 10-5 (Apprx Fixed-Point) - Fixed-point

FlexSphere BERUsing Novel Interference Cancellation

Key to performance is on determining suitable order in which to process the antennas

Innovative element of this design is on the use of interference cancellation in channel matrix pre-processing and novel real-valued ordering of the sphere detector tree

BER plot for fixed-point design generated from simulation of System Generator model

Floating-point matlab model overlay also shown

0 5 10 15 20 2510

EbNo [dB]

64 QAM, 4X4

Fixed-Point FPGA Sphere Det./preprocessingFloating-point Matlab Sphere Det./preprocessingMaximum Likelihood

802.16e Sphere DetectorTree Search

Once the tree ordering is determined a detection tree is assembled using real-valued decomposition and approximate sorting strategy to minimize detection latency

Sphere Detector Processing Pipeline

7,:R (7)y 6,:R (6)y

8i 7i 6i 1i

1,:R (1)y

( )iiT s

2( ) ( 1) ( )1

i i ii i iT T e

ii i ij j

b y R s

1 2 8 57 58 64

Detected Vector

Min-Search

Development / Debugging Environment (Host PC)Deployment Environment

802.16e Sphere DetectorDevelopment/Deployment

LocalLink Interface

Gigabit Ethernet Switch

SysGen HW Co-sim APISysGen HW Co-sim APIMicroprocessorMicroprocessor DSP

ProcessorDSP

Processor

StandaloneC/C++ Program

MATLAB/SimulinkMATLAB/Simulink

Ethernet Co-simulation Protocol

Virtex-5 FPGA

Virtex-5 EMACLocal-Link Wrapper

ClockGenerator

MemoryMap

EthernetCo-simulation

Processor

EthernetCo-simulation

Processor

PHYPHY

I/OPortsI/O

SharedRegistersShared

Registers

SharedRAMs

SysGen DSP Design

(HW CosimDesign)

Antenna layerprocessing, QRD, Sphere detector

SysGen DSP Design

(HW CosimDesign)

Antenna layerprocessing, QRD, Sphere detector

RX FIFO

TX FIFO

RX FIFO

TX FIFO

UserFunction 2

Virtex-6 LX240T: 241,152 LCs, 768 MPYs USB, ethernet Hardware cosim support using Xilinx DSP

802.16e Sphere DetectorBERT using HW Cosim

System Generator HW Cosim module from previous slide

PN seq generator

Channel Preproc.y

Flex-Sphere Detector

6b QAM mapper

Gaussian noise gen

H~PEDmin

sout 64QAM de-mapper

~ Self-synchronizing

BER tester

Eb/Nodata register

EbNo ctrl

Calc scale factor

BER_ready

BER_out

MIMO-OFDM Spatial Multiplexing Detector

Modulation

802.16e Sphere DetectorBERT using HW Cosim: Demo UI

0 5 10 15 20 2510-5

802.16e Sphere DetectorFPGA Resource Utilization

V-BLAST channel ordering is key Each sub-carrier for each OFDM symbol

– could probably be relaxed to reduce FPGA area

Main compute complexity is in the antenna ordering 802.16e 5 MHz bandwidth, 360 data sub-carriers, 4x4 antenna

configuration, 64-QAM Real-time operation at 102.9 microsecond frame rate Compute performance

– ~53E9 MPYs/second, ~21E6 matrix inversions/second, 225E6 path metric computations/second

Function Slices LUT/FFT Pairs Block Memory (18k)

DSP Slices

Channel Pre-processor 9,999 20,339/29,954 105 159 RVD QRD 1,715 4,418/5,556 27 30 Sphere Detector 2,445 3,113/6,525 12 48 Total 14,159 27,870/42,035 144 237

Conclusion

Key to performance is on determining suitable order in which to process the antennas

Innovative element of this design is on the use of interference cancellation in channel matrix pre-processing and novel real-valued ordering of the sphere detector tree

Might not want to lock this function down in ASIC/ASSP– desire to have flexibility to tailor ML-decoder to air-interface requirements– upgrade with new algorithms

Algorithmic, architecture and micro-architecture optimizations required for good implementation

Architectural transparency provided by System Generator design flow allows for production of FPGA optimized implementation

FPGA implementation of a near-ML sphere detector for 802...

Documents