Post on 21-Aug-2020
transcript
FPGA implementation of a near-ML sphere detector for 802.16e broadband wireless systems
chris dick: XilinxMilos Trajkovic, Slobodan Denic, Dragan Vuletic: Signum ConceptsRaghu Rao: Xilinxfred harris: Signum Concepts & San Diego State UniversityKiarash Amiri: Rice University
Page 2
Agenda
Spatial multiplexing MIMO Quasi-ML SM decoders and sphere detection Useful approximations for efficient hardware implementation Guided tour of sphere detector implementation for 802.16e Implementation results
Page 3
802.16e Sphere Detector Implementation
Specifications– 5MHz WiMAX– 360 data sub-carriers– antenna ordering to be updated for every sub-carrier for every OFDM symbol
• extremely rapid update; could probably be relaxed
– 2,3 or four antennas programmable dynamically at run-time– QPSK, 16-QAM and 64-QAM configurable on a per user basis
Mod
ulat
ion
and
Map
ping
TX
... ...
Sig
nal P
roce
ssin
g
RX
MIMO channel
Ser
ial t
o P
aral
lel
Par
alle
l to
Ser
ial
... ...
In OutDACRF
ADCRF... ...
11 12 13
21 22 23
31 32 33
h h hh h hh h h
H
TX ant num
RX a
nt n
um
Page 4
FPGA Sphere Detector Effective FPGA implementation is intersection of:
Algorithm optimizations• K-best, depth-first search, sort-free,…
Architecture• clock rates• hardware folding
Micro-architecture optimizations that are possible with the FPGA• bit-width optimizations• math optimizations: norms to simplify hardware footprint for
sphere detector for example
1 2vs vs
Page 5
Sphere Detection – QRD Pre-processing
Maximum Likelihood detection
Key element of complexity reduction is suitable factoring of channel matrix H
MR x MT matrix Q has orthonormal columns and R is upper triangular Distance metric can be written as
H QR
2 ZFˆ ˆ( ) with Hd s c y y Rs y Q Rs 'y R s 2||||
2ˆ arg minMTs
s
y Hs
2ˆ y R s
Page 6
Example: Solving for Antenna 2, Real Component of Symbol
Solve for symbol transmitted on antenna 1 by performing hypothesis test– recall that for QPSK the coordinates for the constellation points are = {±3,±1}
Form all of the possible required products Rs and differences as shown
Now make a decision for the real component of the symbol sent on antenna 1
1,1 1,2 1,31
2,
0,
2
3
2
0,0
3
0,10
2 2,32
3,
0
3
1
3
,2
2
3
0( )
0 00 ˆ0
ˆ
ˆ0
T
T
T
M
M
MR R RyR Ry
Rys
R
sRy
RR
d
s
s
1 3 1 1 3ˆ min , , ,TMs d s d s d s d s
1,1 1,2 1,31
2,
0,
2
3
2
0,0
1
0,10
2 2,32
3,
0
3
1
3
,2
2
1
0( )
0 00 ˆ0
ˆ
ˆ0
T
T
T
M
M
MR R RyR Ry
Rys
R
sRy
RR
d
s
s
1,1 1,2 1,31
2,
0,
2
3
2
0,0
1
0,10
2 2,32
3,
0
3
1
3
,2
2
1
0( )
0 00 ˆ0
ˆ
ˆ0
T
T
T
M
M
MR R RyR Ry
Rys
R
sRy
RR
d
s
s
1,1 1,2 1,31
2,
0,
2
3
2
0,0
3
0,10
2 2,32
3,
0
3
1
3
,2
2
3
0( )
0 00 ˆ0
ˆ
ˆ0
T
T
T
M
M
MR R RyR Ry
Rys
R
sRy
RR
d
s
s
Page 7
Process of Sphere Decoding
The decoding process can be mapped to finding a shortest path (with minimum Euclidean distance) in a tree topology
decodingsequence
Computational complexity for one solution Adds: M(M+1)/2 + 2M − 1 Mults: M(M+1)/2 complex mults 2M square operators Mem for H: M2 complex values
E.g. 16×16 systems Nadd = 167 Nmult = 136 complex mults 32 square operators Mem for H: 256 complex values
2
1 1,1 1 1,2 2 1,
2 2 2,2 2 2,
,
21,1 1
2,2 2
1
2
,
ˆ
ˆˆ
ˆ
T
T
T T
T T TT
T
M M
M M
M
M M M M
M M M
y R s R s R s
y R s R s
y R s
R sbb
b
R s
R s
y Rs
ant-1ant-2
ant-MT
common term for each row
Page 8
Mapping Distance Computation to FPGA
The actual system we are considering is 4x4 and the matrix equation looks like
0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,70
1,1 1,2 1,3 1,4 1,5 1,6 1,71
2,2 2,3 2,4 2,5 2,6 2,72
3,3 3,4 3,5 3,6 3,73
4,4 4,5 4,6 4,74
5,5 5,6 5,75
6,6 6,76
7,77
R R R R R R R RyR R R R R R Ry
R R R R R RyR R R R Ry
R R R RyR R Ry
R RyRy
2
0
0
1
1
2
2
3
3
ssssssss
The element wise product and subtraction maps well to the DSP48 element in the FPGA
Leverages the dedicated routing between cascaded DSP48’s to minimize the critical path and hence maximize the FPGA clock frequency
DSP48 tile contains a multiplier for computing the products and add/sub for accumulating the partial-products
C
A
B
PCOUT
DSP48
PCIN
A
B
PCOUT
DSP48
iy
Symbol
( )R
Symbol
( )R
Dedicated high-speed interconnect in Xilinx FPGA
PCIN
A
B
PCOUT
DSP48
( )R
SymbolPath MetricUpdate
Sphere detector essentially a streaming dataflow computation Parallel processing pipeline for computing the incremental distance metric
Page 9
i=5
i=4
i=3
i=2
i=1
Sphere Detection Tree Example: MT =4 antennas, K=3, 16-QAM
K survivors at each level requires sort
→ high-latency process
25 5,7 7 5,6 6 5,5 5y R s R s R s i=6
26 6,7 7 6,6 6y R s R s i=7
27 7,7 7y R s
i=8
MIN value: Detected Vector
4ˆRe s 4ˆIm s
3Re s
3Im s
2ˆIm s
2ˆRe s
1Im s
1Re s
Page 10
FlexSphere Detection Tree4x4 64-QAM
Two levels in tree correspond to one antenna First 2 levels fully expanded Expansion of the minimum node rather sorting whole level
– approximate sort avoids requirement for high-latency min-search at each level
Page 11
Micro-Architecture Optimizations to Reduce Cost: Simplified Norms
Motivation for using simplified norm is to reduce complexity of implementation Minimize FPGA resource footprint Mechanism that permits tradeoff
between FPGA architectural elements, e.g. DSP48’s for LUTs
2
221
Let partial sphere constraint is
i i
i i i
T X
X X e r
2
1
1
1
approximate the norm with a different norm
or
max ,
i i i
i i i
i i i
X f X e
X X e
X X e
1
In the complex-valued case, corresponding approximations for are
norm: or
norm: max ,
i
i i i
i i i
e
e e e
e e e
Page 12
L-1, L-2 Norm Comparison
Negligible impact using simpler L-1 norm versus L-2 norm
K-Best sphere detector is quasi MLD solution Already some BER degradation introduced with K-Best
Eliminates MPYs in implementation which may be beneficial
0 5 10 15 20 25 3010
-5
10-4
10-3
10-2
10-1
100
EbNo [dB]
BE
R
16 QAM, K-best, 2X2
K-best (K=4), L-2 normK-best (K=4), L-1 normK-best (K=6), L-2 normK-best (K=6), L-1 normML
Floating-point simulation
Page 13
802.16e Sphere DetectorAntenna Ordering
Figure below shows the antenna ordering pipeline
1† * *
2 2
1 1 ,[ ]
, 1,
arg for 1 arg for 1
,
max min
j
j j j j j T
j j j jk kk k
j j k
j M
k j k j
G H H H H
G G
H H H H
2, 2G i
2,1G i
2,3G i
2, 4G i
norm calculations
Page 14
802.16e Sphere DetectorTop-Level Block Diagram
Key to achieving good BER performance is around how the antenna ordering is performed– the ‘V-BLAST channel reordering’ component above
Assume we have the channel estimate
H sortedH
yy
H R
y
Page 15
802.16e Sphere DetectorAntenna Ordering Processing Pipeline
Linear processing pipeline of MT-1 stages employed Each stage: QRD matrix inversion, norm, updated H Buffer memories store intermediate results and support the
TDM pipeline
Antenna ordering: Top Each stage of the pipeline
realized using a QRD-style systolic array
G matrix calc
Norm search
Update Hsorted
G matrix calc
Norm search
Update Hsorted
G matrix calc
Norm search
Update Hsorted
360*32x18b
360*32x18b
5*16*2x18b
5*16*2x18b
5*16*2x18b
5*16*2x18b
5*16*2x18b
5*16*2x18b
4X4 matrix calculations
3X3 matrix calculations
2X2 matrix calculations
Page 16
Matrix Inversion using complex QRDPipeline Processing
QRD using Givens rotations
Angle estimation and vector rotation performed using complex multiplication and approximation
High-speed pipelined architecture using Xilinx DSP tiles (DSP48)
Heavy pipelining and time division multiplexing of hardware employed to meet latency/throughput requirements of 802.16e
Channel Matrix H 4x4 Configuration
3,0h 3,1h
*3,0
3,1 3,12multication3,0performed ininner cell
hh h
h
I jQ
2 2I QCmplx.Conj.
22 2
I jQ
I Q
To Inner Cells
1( ) ( ) ( );
( ) and ( ) stored in FPGA BRAMaddition and ( ) realized using DSP48
f x x f x x f x f xx
f x f xx f x
3,1h 3,1h
1/ x
1/ x
2
2
Cmplx.Conj.
SaveCurrent
ReadPrevious
To Inner Cells
First Set of Rotations
Perform rotations to successively null matrix entries
22 2
I jQ
I Q
Page 17
Approximating 1/sqrt(x)f(x+x) = f(x) + x f’(x); f(x) = 1/sqrt(x)
0 0.5 1 1.5 2 2.5 3x 105
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
1.5Floating-point Matlab and Floating-point Approximation
x
1/sq
rt(x)
Floating point MatlabFloating point approximation
0 0.5 1 1.5 2 2.5 3x 105
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
1.5Floating-Point Matlab and Fixed-Point Approximation
x
1/sq
rt(x)
Floating point MatlabFixed point approximation
0.5 1 1.5 2-4
-2
0
2
4
6
8x 10-6 Difference Function - Floating-Point
x
App
roxi
mat
ion(
1/sq
rt(x)
) - 1
/sqr
t(x)
0.5 1 1.5 2-1.5
-1
-0.5
0
0.5
1
1.5x 10-5 (Apprx Fixed-Point) - Fixed-point
x
App
roxi
mat
ion(
1/sq
rt(x)
) - 1
/sqr
t(x)
Page 18
FlexSphere BERUsing Novel Interference Cancellation
Key to performance is on determining suitable order in which to process the antennas
Innovative element of this design is on the use of interference cancellation in channel matrix pre-processing and novel real-valued ordering of the sphere detector tree
BER plot for fixed-point design generated from simulation of System Generator model
Floating-point matlab model overlay also shown
0 5 10 15 20 2510
-5
10-4
10-3
10-2
10-1
100
EbNo [dB]
BE
R
64 QAM, 4X4
Fixed-Point FPGA Sphere Det./preprocessingFloating-point Matlab Sphere Det./preprocessingMaximum Likelihood
Page 19
802.16e Sphere DetectorTree Search
Once the tree ordering is determined a detection tree is assembled using real-valued decomposition and approximate sorting strategy to minimize detection latency
Sphere Detector Processing Pipeline
8,:R
(8)y
7,:R (7)y 6,:R (6)y
8i 7i 6i 1i
1,:R (1)y
( )iiT s
2( ) ( 1) ( )1
i i ii i iT T e
s s s
( 1)1
1
ˆTM
ii i ij j
j i
b y R s
s
1 2 8 57 58 64
1 8
1
Detected Vector
Min-Search
i=2
i=1
i=3
i=4
i=5
i=6
i=7
i=8
Page 20
Development / Debugging Environment (Host PC)Deployment Environment
802.16e Sphere DetectorDevelopment/Deployment
LocalLink Interface
Gigabit Ethernet Switch
SysGen HW Co-sim APISysGen HW Co-sim APIMicroprocessorMicroprocessor DSP
ProcessorDSP
Processor
StandaloneC/C++ Program
StandaloneC/C++ Program
MATLAB/SimulinkMATLAB/Simulink
Ethernet Co-simulation Protocol
Virtex-5 FPGA
Virtex-5 EMACLocal-Link Wrapper
SGMII
ClockGenerator
ClockGenerator
MemoryMap
MemoryMap
EthernetCo-simulation
Processor
EthernetCo-simulation
Processor
PHYPHY
I/OPortsI/O
Ports
SharedRegistersShared
Registers
SharedRAMs
SharedRAMs
SysGen DSP Design
(HW CosimDesign)
Antenna layerprocessing, QRD, Sphere detector
SysGen DSP Design
(HW CosimDesign)
Antenna layerprocessing, QRD, Sphere detector
EMAC0
EMAC1
RX FIFO
TX FIFO
RX FIFO
TX FIFO
UserFunction 2
UserFunction 2
Virtex-6 LX240T: 241,152 LCs, 768 MPYs USB, ethernet Hardware cosim support using Xilinx DSP
tools
Page 21
802.16e Sphere DetectorBERT using HW Cosim
System Generator HW Cosim module from previous slide
PN seq generator
Channel Preproc.y
Flex-Sphere Detector
6b QAM mapper
DeM
ux
s~
Gaussian noise gen
H~
Gaussian noise gen
DeM
ux
n~
~
H~PEDmin
sout 64QAM de-mapper
~ Self-synchronizing
BER tester
Eb/Nodata register
EbNo ctrl
Calc scale factor
BER_ready
BER_out
MIMO-OFDM Spatial Multiplexing Detector
Modulation
Page 22
802.16e Sphere DetectorBERT using HW Cosim: Demo UI
0 5 10 15 20 2510-5
10-4
10-3
10-2
10-1
Page 23
802.16e Sphere DetectorFPGA Resource Utilization
V-BLAST channel ordering is key Each sub-carrier for each OFDM symbol
– could probably be relaxed to reduce FPGA area
Main compute complexity is in the antenna ordering 802.16e 5 MHz bandwidth, 360 data sub-carriers, 4x4 antenna
configuration, 64-QAM Real-time operation at 102.9 microsecond frame rate Compute performance
– ~53E9 MPYs/second, ~21E6 matrix inversions/second, 225E6 path metric computations/second
Function Slices LUT/FFT Pairs Block Memory (18k)
DSP Slices
Channel Pre-processor 9,999 20,339/29,954 105 159 RVD QRD 1,715 4,418/5,556 27 30 Sphere Detector 2,445 3,113/6,525 12 48 Total 14,159 27,870/42,035 144 237
Page 24
Conclusion
Key to performance is on determining suitable order in which to process the antennas
Innovative element of this design is on the use of interference cancellation in channel matrix pre-processing and novel real-valued ordering of the sphere detector tree
Might not want to lock this function down in ASIC/ASSP– desire to have flexibility to tailor ML-decoder to air-interface requirements– upgrade with new algorithms
Algorithmic, architecture and micro-architecture optimizations required for good implementation
Architectural transparency provided by System Generator design flow allows for production of FPGA optimized implementation