Anda di halaman 1dari 35

CRU HW update, HW Improvement Status

ALICE System Upgrade Forum, CERN, 23 Jan, 2019

Tivadar KISS, Tuan Máté NGUYEN, Ernő DÁVID, Wigner RCP on behalf of the CRU team
and
K. ARNAUD, J.-P. CACHEMICHE, F. HACHON, R. LE GAC, and F. RÉTHORÉ, CPPM

23.01.2019 System Upgrade Forum, CERN 1


CRU Hardware: PCIe40 v2 of LHCb

Design Consolidation & Improvments


New Power
USB Blaster II Subsystem
Improved Cooling

New Clock Tree


Faster Arria 10 FPGA
10AX115S3F45I2SG

Manufacturability
optimizations
2x SFP+

23.01.2019 System Upgrade Forum, CERN 2


CRU Project Milestones

V3.9 (Aug-18) 2015 2016 2017 2018 2019 2020


HW DR

Prototype HW (PCIe40)
completed (12/15)
1st FW release (Q4-16)

CRU EDR (20/6/16)

First PCIe40 CRU for ALICE


(1/17)
CRU common firmware Release

Pre-series production, PRR


(14/5/18)
Production and test finished
(Q1-Q2-19)
LS2

In focus:
• PCB signal integrity issue. Analysis of the issue and possible improvments.
• Verification of the corrected PCBs in multiple ways.
Founding of the Probem

In September 2018, the first cards from the European PCIe40 production (at FEDD)
were tested and it was the first time PCIe40_v2 had been tested close to its limits.
• Link errors were found at heavy firmware loads
• That time ALICE CRU has not had a firmware stressing the link enough
to observe the errors.

The original HW design goals for the PCIe40 card were very ambitious:

FPGA occupancy 90%

Running frequency of the logic 40 MHz, 240 MHz

Toggle rate 50%

0.9V FPGA core voltage 45 A


consumption
Target range for power noise 1 kHz – 600 MHz
decoupling

Both LHCb /CPPM and ALIE /CRU team immediatey started to evaluate the problem.
New PCB design review started in CPPM, while we started to prepare for a more exhaustive testing.

4
Original v2.1 Board Stack-up

The original v2.1 board stack-up and the assumption that the errors are caused by EM coupling
of noise from the VCC powrer plane to the VCC_T power plane.
(The power plane layout design indeed shoed a wekness there.)

5
Possible Improvements Investigated

• CPPM investigated several solutions to reduce this coupling without re-routing of the board.
• They consist of inverting feeding planes while respecting manufacturing constraints.
• All have pros and contras. Two solutions had been choosen and put in production (2 + 2 pcs.)

• However the errors remained there. A more thorough analysis had began... 6
LHCb Link Errors Test FW

LHCb

Dependency of the error rate from the clocking frequency


of the logic got into the focus soon

23.01.2019 System Upgrade Forum, CERN 7


Errors at High and Low Clock Frequencies

23.01.2019 System Upgrade Forum, CERN 8


Repeating of the LHCb Tests in ALICE

• Reproducing LHCb tests and results on our 4 boards (all PCB version v2.1)
• With the LHCb test environment we can see the same behavior

23.01.2019 System Upgrade Forum, CERN 9


VCC Noise Peak at Low Operating Frequency
(< 35 MHz)

Porting the LHCb test FW into the Intel Arria 10 Dev Board,
errors cannot be seen there, but the noise peak is still there.

Errors as function of occupancy

• Script made for compiling 15 firmwares going


from 15% to 90% by step of 5%
• Measurements of errors when logic is
operated around 25 MHz in each case
• These tests were done on PCB v2.2.x
• With modified (randomized) test FW

23.01.2019 System Upgrade Forum, CERN 10


Link errors vs. running frequency
and FPGA occupncy

Sum of errors in all 48 links as a function of the FPGA occupancy and running frequency.
External optical loopback, PCB version v2.2.2.

23.01.2019 System Upgrade Forum, CERN 11


Investigations in Multiple Directions...

Many tests, measuremets, simulations were carried out in parallel to understand the
problem.
• Measuring external PLL phase noise again
• Adding large capacitors to increase decoupling in low frequency domain, no effect
• Testing the power mezzanines in themselves with dynamic load for noise coupling
between power rails (all within specifications)

Trying and compare PCIe40 to a Reference Hardware (Intel Arria 10 Dev Board)
• FW design ported, no errors in the Dev Board, although ripple noise peak is there

Several simulations with high-end tools (Ansoft, Sigrity)


• Simulating of decoupling, plane impednace and resonance by CPPM and ALICE experts

Consulting with Intel


• FPGA internal issues is not likely

Study (review) of an external expert on PCB design


• No design errors found. Excessive inductance of the power mezzanine connectors may
contribute to the problem. Proposes choosing PCB v2.2.2 over v2.2.3. Other
observations and suggestions included in Slide 15, „LHCb Conclusions”

23.01.2019 System Upgrade Forum, CERN 12


Exploring the entire domain

• Working domain for the PCIe40 module v2.2.2. It is explored by using the qualification firmware utomatically
generated with different occupancies. The runnning frequency varies by steps of 1 MHz between 1 and 50 MHz
and then by steps of 40 MHz. The FPGA occupancy varies by steps of 5%.

• Serial transmitters and receivers are interconnected by external fibers. The area in green shows where serial
transmission is corrupted. The iso lines show the current for the 0.9 V core supply voltage (VCC).

23.01.2019 System Upgrade Forum, CERN 13


Comparison of the improved PCB designs
v2.2.2 vs. v2.2.3

23.01.2019 System Upgrade Forum, CERN 14


LHCb Conclusion and Proposal

• The full working domain of the PCIe40 card up to the design limits has been explored. (FPGA occupancy
between 10 and 90%, the running frequency between 1 and 320 MHz, and the toggle rate is fixed to about
50%.)

• The PCIe40 module has requirements beyond typical high-speed designs. The module shows a weakness
when therunning frequency is below ~ 35 MHz. It is probably due a global effect of parasitic capacitance
and inductance from the FPGA package and connector. This effect is reinforced by an excessive
inductance of the mezzanine housing power supply.

• It has been shown that the current design work fine for final firmwares of the LHCb and ALICE.*

• All these designs run with a mixture of frequencies (40, 220, 240, 250, 280 MHz) which pushes
theceffective running frequency above the critical area. In addition, the 0.9 V current consumption shows
that the toggle rate is below 50%. If some FW design have to run close to the critical area and exhibit
errors, several mitigation techniques can be used to solve the issue.

• All these studies and the comparative jitter measurements show that of the two new, improved PCB
stackup versions v2.2.2 is somewhat better than v2.2.3, and it is also superior to v2.1. (planned to be
produced originally - [T.K.]).

• CPPM recommend to resume the production of the PCIe40 module by using the PCB version v2.2.2
without further modifications.

• Plannings are prepared with SOMACIS and FEDD companies. They are waiting for a green light by the end
of January.

23.01.2019 System Upgrade Forum, CERN 15


GBT links - internal loopback tests

Our own testing with CRU FW v.2.5 and dummy user logic
• Unused fabric are filled up with dummy logic for switching noise generation
• Long chain of 250K...1M flip-flops (FPGA occupancy up to 87%)
• Clocking: 280 MHz (fixed)
• Toggling rate can be scaled by the data pattern circulated (0...100%)
• Noise generation can be switched on/off by SW after PLL and transceivers got calibrated

Internal loopback of 24 GBT links in WideBus mode inside the FPGA


• Loopback data in all 24 GBT links, FPGA counts the errors (4480 Mb/s data transfer rate)
• First, optical components were not involved
• The goal was go verify the FW design changes
• No DMA transfer in these first tests

750K FF (72% occupancy), 50% toggling, no DMA


• 4 cards were tested 10..16 hours
• No errors observed

750K (72% occuoancy), 75% toggling, and 1M FF (87% occupancy) , 50% toggling, no DMA
• 2 cards were tested for a few hours hours
• No errors observed

23.01.2019 System Upgrade Forum, CERN 16


GBT links - extrenal loopback tests

External loopback of 24 GBT links in WideBus mode with optical patch cables
• The same dummy user logic was used for noise generation

750K FF (72% occupancy), 50% toggling, no DMA:


• 4 cards were tested for 16..23 hours
• no errors detected
750K FF (72% occupancy), 75% toggling, no DMA:
• 2 cards were tested for 2 hours
Typical FPGA temperatures:
• no errors detected (in “Server3”, closed lid, max airflow)
1M FF (87% accupancy), 50% toggling, no DMA: Dummy logic not toggling: ~ 40 °C
• 1 cards were tested for 11.4 hours 750K FF, 75% toggling, no DMA: 70..73 °C
• no errors detected 1M FF, 50% toggling, no DMA: 70..72 °C
1M FF (87% occupancy), 50% toggling, DMA:
• 1 cards were tested for 17.8 hours
• no errors detected To be done:
1. Testing more cards with 1M FF, 50%
toggling + DMA

23.01.2019 System Upgrade Forum, CERN 17


GBT link tests with stressed connection

1 Link, VLDB Loopback, GBT mode


• GBTx counts errors (corrected by FEC)
in downlink
• FPGA counts errors in uplink

DOWNLINK: 200 m cable + 0 dB attenuation

• 750K FF (72% occupancy), 50% toggling,


no DMA:
• 4 cards were tested for 8..17 hours
• no errors detected

• 750K FF (87% occupancy), 50% toggling,


+ DMA:
• 1 cards were tested for 7.5 hours To be done:
• no errors detected
1. Testing more links and cards
with 1M FF, 50% toggling + DMA
• 1M FF, 50% toggling,
+ DMA: 2. Testing with added attenuation
• 1 cards were tested for 18.5 hours (10 dB... +) to see the budget
• no errors detected
23.01.2019 System Upgrade Forum, CERN 18
Testing with TPC User Logic

Testing with real User Logic is also a must


• TPC Cluster Finder logic is used for switching noise generation
• Presently it has the same occupancy (~87%) as the 1M Flip-flop dummy logic
• Pseudo random data pattern

Status: almost done…


• FW integration done
• SYNC patterns detected
• No significant temperature rise
• No data transferred to server
To be done:
Debugging is going on, 1. Testing all 4 cards with 24-link external
hopefully we can run the first tests (optical) loopback test with toggling
already this week user logic
2. Testing VLDB - CRU links (a few) &
possibly TPC FEC – CRU links
with toggling CF user logic
3. Testing junction temperature

23.01.2019 System Upgrade Forum, CERN 19


Summary of our Test Results

• In the beginning of our systematic testing, we suffered from link instability at high user logic
toggling rates. It took time to understand and fix this issue, the solution was to change our
scheme of distrbution of the reference clock to the transceiver banks. After that we re-started
the systematic testing of 4 CRU v2.1 boards from mid Dec.

• Presently we run our tests at the fixed, ALICE specific user logic clock rate, i.e. 280 MHz,
with different different dummy logic sizes and toggling rates. This is just the complement of
the LHCb testing with fixed toggling rate but scalable clock frequency.

• We step-by-step increasing the load and complexity of testing and will shortly run tests with
TPC cluster finder User Logic.

• With the improved CRU FW and at our specific clocking frequency, we now cannot see
errors neither in the GBT links nor in the PCIe lanes. However not all tests are completed
yet. Testing will be continued in the next days by testing with demanding load configurations,
and replacing the dummy logic to the TPC Cluster finder logic.

• In general, the affect of the PCB noise issue on possible ALICE implementations looks much
smaller than thought earlier, and possible error rates (if at all) does not seem to be critical.
20
23.01.2019 System Upgrade Forum, CERN 20
Our Summary of the Situation

• Two different improved PCB layout has been designed and tested. Although they are more
robust in terms of power integrity, these modifcations themselves did not solve the problem.

• By signal quality measurements, and also by theoretical considerations, the PCB design
alternatives have been qualified and v2.2.2 has been identified as the optimum solution.

• The board still has a weakness that at certain clocking rates PDN resonancies can happen
reinforcing the noise on VCC. With heavy load FW (high occupancy and large toggling rate)
this may result in link erros in the CRU TX direction (only).

• There is no obvious design issue found by the designers and external experts. There is no
lead to further improve the PCB unless integrating the power module onto the main board
which is at this stage of the projects is practically impossible. Experts give the hint that
decoupling cannot be optimized for such a wide operating frequency range.

• There are mitigation technics in FW design that can handle such situations, esp. in the low
frequency domain. Moreover, for the GBT downlink direction we are also proteced by the
forward error correction (FEC) mechanism of GBT.

• We basically agree with the analysis of the CPPM team and also with the conclusion to
resume production with the improved PCB v2.2.2. 21
23.01.2019 System Upgrade Forum, CERN 21
Reserve Slides

23.01.2019 System Upgrade Forum, CERN 22


SI Issue Found by CPPM

• Now we can also reproduce


the errorneous behaviour
• FW is now clean enough to say
that the errors are HW ones, not
FW design issues
Title

23.01.2019 System Upgrade Forum, CERN 24


Title

23.01.2019 System Upgrade Forum, CERN 25


Title

23.01.2019 System Upgrade Forum, CERN 26


Explanation

• Coupling of the VCC plane in layer 7 towards the VCCR and VCCT in layer 6
because of overlaping areas.
• VCCT and VCCR: overlapping areas with highly different current densities

23.01.2019 System Upgrade Forum, CERN 27


Title

23.01.2019 System Upgrade Forum, CERN 28


Power Planes Noise Spectrum

23.01.2019 System Upgrade Forum, CERN 29


Simulations

23.01.2019 System Upgrade Forum, CERN 30


Simulations

23.01.2019 System Upgrade Forum, CERN 31


Simulations

23.01.2019 System Upgrade Forum, CERN 32


Error Fluctuation (without power cycle)
CRU FW v2.5 + 250K FFs @ 240MHz, 50% toggling

JTAG programming + Reboot + Recalibration


Reboot + Recalibration
Recalibration 33
23.01.2019 System Upgrade Forum, CERN 33
Eye Diagrams

• A lot of eye diagrams have been measured (~20 diagrams) while running different tests
• Detaild analysis (eyes vs. Error rates) is going on. Main points at first sight:
• Otherwise we did not see significant degradation in any of our tests
• Our waveforms have a little drop in the plateau.
• Jitter is a little bit bigger with the fPPLs tan with the ATX PLL
• Our eyes opening is similar that what you measured at around 240 MHz operating frequency.

34
23.01.2019 System Upgrade Forum, CERN 34
Summary on the internal (w/ATX PLL) vs. external (w/ fPLL)
reference clock distribution tests

Good hope: external refernce clock distribution will solve the additional,
ALICE specific problem.

35
23.01.2019 System Upgrade Forum, CERN 35

Anda mungkin juga menyukai