Tivadar KISS, Tuan Máté NGUYEN, Ernő DÁVID, Wigner RCP on behalf of the CRU team
and
K. ARNAUD, J.-P. CACHEMICHE, F. HACHON, R. LE GAC, and F. RÉTHORÉ, CPPM
Manufacturability
optimizations
2x SFP+
Prototype HW (PCIe40)
completed (12/15)
1st FW release (Q4-16)
In focus:
• PCB signal integrity issue. Analysis of the issue and possible improvments.
• Verification of the corrected PCBs in multiple ways.
Founding of the Probem
In September 2018, the first cards from the European PCIe40 production (at FEDD)
were tested and it was the first time PCIe40_v2 had been tested close to its limits.
• Link errors were found at heavy firmware loads
• That time ALICE CRU has not had a firmware stressing the link enough
to observe the errors.
The original HW design goals for the PCIe40 card were very ambitious:
Both LHCb /CPPM and ALIE /CRU team immediatey started to evaluate the problem.
New PCB design review started in CPPM, while we started to prepare for a more exhaustive testing.
4
Original v2.1 Board Stack-up
The original v2.1 board stack-up and the assumption that the errors are caused by EM coupling
of noise from the VCC powrer plane to the VCC_T power plane.
(The power plane layout design indeed shoed a wekness there.)
5
Possible Improvements Investigated
• CPPM investigated several solutions to reduce this coupling without re-routing of the board.
• They consist of inverting feeding planes while respecting manufacturing constraints.
• All have pros and contras. Two solutions had been choosen and put in production (2 + 2 pcs.)
• However the errors remained there. A more thorough analysis had began... 6
LHCb Link Errors Test FW
LHCb
• Reproducing LHCb tests and results on our 4 boards (all PCB version v2.1)
• With the LHCb test environment we can see the same behavior
Porting the LHCb test FW into the Intel Arria 10 Dev Board,
errors cannot be seen there, but the noise peak is still there.
Sum of errors in all 48 links as a function of the FPGA occupancy and running frequency.
External optical loopback, PCB version v2.2.2.
Many tests, measuremets, simulations were carried out in parallel to understand the
problem.
• Measuring external PLL phase noise again
• Adding large capacitors to increase decoupling in low frequency domain, no effect
• Testing the power mezzanines in themselves with dynamic load for noise coupling
between power rails (all within specifications)
Trying and compare PCIe40 to a Reference Hardware (Intel Arria 10 Dev Board)
• FW design ported, no errors in the Dev Board, although ripple noise peak is there
• Working domain for the PCIe40 module v2.2.2. It is explored by using the qualification firmware utomatically
generated with different occupancies. The runnning frequency varies by steps of 1 MHz between 1 and 50 MHz
and then by steps of 40 MHz. The FPGA occupancy varies by steps of 5%.
• Serial transmitters and receivers are interconnected by external fibers. The area in green shows where serial
transmission is corrupted. The iso lines show the current for the 0.9 V core supply voltage (VCC).
• The full working domain of the PCIe40 card up to the design limits has been explored. (FPGA occupancy
between 10 and 90%, the running frequency between 1 and 320 MHz, and the toggle rate is fixed to about
50%.)
• The PCIe40 module has requirements beyond typical high-speed designs. The module shows a weakness
when therunning frequency is below ~ 35 MHz. It is probably due a global effect of parasitic capacitance
and inductance from the FPGA package and connector. This effect is reinforced by an excessive
inductance of the mezzanine housing power supply.
• It has been shown that the current design work fine for final firmwares of the LHCb and ALICE.*
• All these designs run with a mixture of frequencies (40, 220, 240, 250, 280 MHz) which pushes
theceffective running frequency above the critical area. In addition, the 0.9 V current consumption shows
that the toggle rate is below 50%. If some FW design have to run close to the critical area and exhibit
errors, several mitigation techniques can be used to solve the issue.
• All these studies and the comparative jitter measurements show that of the two new, improved PCB
stackup versions v2.2.2 is somewhat better than v2.2.3, and it is also superior to v2.1. (planned to be
produced originally - [T.K.]).
• CPPM recommend to resume the production of the PCIe40 module by using the PCB version v2.2.2
without further modifications.
• Plannings are prepared with SOMACIS and FEDD companies. They are waiting for a green light by the end
of January.
Our own testing with CRU FW v.2.5 and dummy user logic
• Unused fabric are filled up with dummy logic for switching noise generation
• Long chain of 250K...1M flip-flops (FPGA occupancy up to 87%)
• Clocking: 280 MHz (fixed)
• Toggling rate can be scaled by the data pattern circulated (0...100%)
• Noise generation can be switched on/off by SW after PLL and transceivers got calibrated
750K (72% occuoancy), 75% toggling, and 1M FF (87% occupancy) , 50% toggling, no DMA
• 2 cards were tested for a few hours hours
• No errors observed
External loopback of 24 GBT links in WideBus mode with optical patch cables
• The same dummy user logic was used for noise generation
• In the beginning of our systematic testing, we suffered from link instability at high user logic
toggling rates. It took time to understand and fix this issue, the solution was to change our
scheme of distrbution of the reference clock to the transceiver banks. After that we re-started
the systematic testing of 4 CRU v2.1 boards from mid Dec.
• Presently we run our tests at the fixed, ALICE specific user logic clock rate, i.e. 280 MHz,
with different different dummy logic sizes and toggling rates. This is just the complement of
the LHCb testing with fixed toggling rate but scalable clock frequency.
• We step-by-step increasing the load and complexity of testing and will shortly run tests with
TPC cluster finder User Logic.
• With the improved CRU FW and at our specific clocking frequency, we now cannot see
errors neither in the GBT links nor in the PCIe lanes. However not all tests are completed
yet. Testing will be continued in the next days by testing with demanding load configurations,
and replacing the dummy logic to the TPC Cluster finder logic.
• In general, the affect of the PCB noise issue on possible ALICE implementations looks much
smaller than thought earlier, and possible error rates (if at all) does not seem to be critical.
20
23.01.2019 System Upgrade Forum, CERN 20
Our Summary of the Situation
• Two different improved PCB layout has been designed and tested. Although they are more
robust in terms of power integrity, these modifcations themselves did not solve the problem.
• By signal quality measurements, and also by theoretical considerations, the PCB design
alternatives have been qualified and v2.2.2 has been identified as the optimum solution.
• The board still has a weakness that at certain clocking rates PDN resonancies can happen
reinforcing the noise on VCC. With heavy load FW (high occupancy and large toggling rate)
this may result in link erros in the CRU TX direction (only).
• There is no obvious design issue found by the designers and external experts. There is no
lead to further improve the PCB unless integrating the power module onto the main board
which is at this stage of the projects is practically impossible. Experts give the hint that
decoupling cannot be optimized for such a wide operating frequency range.
• There are mitigation technics in FW design that can handle such situations, esp. in the low
frequency domain. Moreover, for the GBT downlink direction we are also proteced by the
forward error correction (FEC) mechanism of GBT.
• We basically agree with the analysis of the CPPM team and also with the conclusion to
resume production with the improved PCB v2.2.2. 21
23.01.2019 System Upgrade Forum, CERN 21
Reserve Slides
• Coupling of the VCC plane in layer 7 towards the VCCR and VCCT in layer 6
because of overlaping areas.
• VCCT and VCCR: overlapping areas with highly different current densities
• A lot of eye diagrams have been measured (~20 diagrams) while running different tests
• Detaild analysis (eyes vs. Error rates) is going on. Main points at first sight:
• Otherwise we did not see significant degradation in any of our tests
• Our waveforms have a little drop in the plateau.
• Jitter is a little bit bigger with the fPPLs tan with the ATX PLL
• Our eyes opening is similar that what you measured at around 240 MHz operating frequency.
34
23.01.2019 System Upgrade Forum, CERN 34
Summary on the internal (w/ATX PLL) vs. external (w/ fPLL)
reference clock distribution tests
Good hope: external refernce clock distribution will solve the additional,
ALICE specific problem.
35
23.01.2019 System Upgrade Forum, CERN 35