Today’s article grew out of a series of concrete observations in everyday testing that prompted me to take a closer look at the PCIe slot and the add in card edge connector mating interface. In my test systems I generally use PCIe 5.0 riser cards to avoid repeated plugging and unplugging at the motherboard, since the slot does not get better with constant use and a new riser card is still cheaper than a new motherboard. Along the way I noticed that some PCIe 5.0 graphics cards ran flawlessly through a riser, while others were only stable when inserted directly into the board. I also encountered a case where a card would not operate error free even when plugged directly into a heavily used slot, and it only ran without error correction activity and without a loss of link speed after I installed it in a brand new motherboard. Notably, in every case the issue up to and including a black screen could be resolved immediately by lowering the transfer rate and forcing the slot to PCIe Gen 4.
When misleading information becomes a standard
It’s a familiar pattern: the PCI SIG defines generous tolerances that are intended to safeguard the mechanics, while the electrophysics are left far behind. With the unspeakable 12VHPWR, it was precisely this disproportion that led to loose contacts, increased heating and an RMA wave including total failures, and even the improvements in the form of the 12V2X6 did little to change the excessively wide windows. Truly stable operation therefore remains unnecessarily risky. With PCIe 5, the principle is now repeated, because CEM 5.1 allows gaps and width ranges on the so-called mated face (“add-in card edge connector mating interface”) of the board, which are compliant on paper, but in practice eat up the near-end crosstalk reserve like Aunt Monika’s fat pug eats its treats.
Today I’m looking at two graphics cards, both of which are within the CEM 5.1 tolerances. However, the first card is conservatively designed, with larger gaps between the edge fingers and correspondingly narrower fingers, while the second card is the opposite model, with narrower gaps and visibly wider fingers, which affects the electrical boundary conditions on the mating face very differently. And you can guess three times which of the two cards fails spectacularly here.


I will now investigate why the first of these two graphics cards works on a mainboard (test system) that is frequently used in the slot and on a brand new replacement board of the same type, even with a PCIe 4.0 riser card (NVIDIA PCAT), while the second card does not. This second card also has the wider fingers that are already visible to the naked eye. Nevertheless, both cards are still just within the standard. Let’s first take a look at the PCI SIG specifications in the current CEM 5.1:

The drawing uses a mixed dimensioning with metric main dimensions and inch values given in square brackets, plus individual directly toleranced dimensions and several specifications marked as typical. This is only plausible and compliant with the standard if it is explicitly stated in the header or note field that millimeters are the leading unit and the values in brackets are pure conversions without independent tolerances. Without such a clarification, the presentation would be misleading because the ± information repeated in brackets could give the impression that both sets of units are equally binding. According to ASME Y14.5 and ISO 129, however, dual dimensioning is permissible. From the point of view of production safety and in view of the sensitivity of PCIe 5, a more precise design is nevertheless recommended. In this form, the drawing is compliant with the standard, but due to the mixture of referential values in brackets, typical specifications and missing explicit limit values, it appears misleading in parts, which in practice could lead to the very design differences that later use up the electrical reserve on the mating face.
Measurements on both cards
In our case, the card with conservative gap geometry runs inconspicuously, while the second card with a narrow gap and wider fingers produces retrains, error correction avalanches and time-outs despite being true to the standard. This article shows why Gen5 only tolerates minimal geometric leeway at the edge of the card, why manufacturers need to hit the middle of the window and tighten their processes and why tolerance should not be confused with robustness, PCI SIG as a standardized misunderstanding without end. The three measurement images that follow show clearly traceable differences in geometry on the mating face. Figure 1 shows the stable running card with a minimum gap between two adjacent contact fingers of 502.81 µm, a maximum width of 498.55 µm and a step height at the pad of approximately 50 µm.

Figure 2 now documents the problematic card with a gap of 357.74 µm and a max gap width of 365.51 µm, the step height is around 40 µm.

The microscope image in Figure 3 also confirms that the problematic card has significantly wider fingers locally, where 630.15 µm were determined, whereas the functioning card remains close to 500 µm. This range suggests a noticeably larger dispersion of the finger width within the problematic card, while the gap is consistently at the lower end of the tolerance corridor.

Evaluation of the electrical clamping surface
On the question of the electrical clamping surface and contact reliability, it should be noted that the spring contacts of the connector meet a defined wiping area on the pad, the contact zone is more linear in operation and is determined by spring force, surface hardness of the hard gold plating and the wiping path, not by the full pad width. As long as the finger width clearly exceeds the footprint of the spring, which is certainly the case at 1.37 to 1.63 mm, widening the finger practically does not change the ohmic contact quality, the contact resistances are predominantly in the range of the metallic micro-asperities, the spring generates the necessary local pressure regardless of the remaining pad width, mechanically the card with the narrower fingers with larger distances is therefore completely sufficient, an electrical advantage through even wider fingers is not to be expected with identical springs.
Too small a gap as the cause of possible failures and the consequences
The gap is decisive for the observed failures because it directly determines the capacitive and inductive coupling between neighboring lines. A simplified coupling model is sufficient for the edge, in which the mutual capacitance is approximately scaled by 1 through s, s is the gap. With reference to the nominal geometry of 0.40 mm, the problematic card with s equal to 0.35774 mm has a factor of 0.40 divided by 0.35774 equal to 1.118, i.e. the coupling increases by around 11.8 percent compared to the target, while the functioning card with s equal to 0.50281 mm is 0.40 divided by 0.50281 equal to 0.796, i.e. around 20.4 percent below the target.
This results in a coupling ratio between the two cards of 502.81 divided by 357.74 equals 1.406, which corresponds to an increase in the interfering near-end crosstalk of around 2.95 dB, calculated with 20 times log to the base 10 of 1.406. The slightly lower pad level of the problem card reduces the lateral field coupling minimally, but this effect remains significantly smaller than the strong gap narrowing; the net effect is a relevant increase in the near-end crosstalk directly at the receiver. Calculated too much? Please be brave, I still have to put you to the test.
Why this has such a strong impact with PCI Express Gen 5 can be explained with a simple frequency analysis. The bit rate is 32 gigatransfers per second, the Nyquist frequency is 16 gigahertz, the useful spectral components extend into this range due to short edges, a typical rise value of around 25 to 30 picoseconds results in a signal amplitude bandwidth of roughly 0.35 divided by t_rise, i.e. around 12 to 14 gigahertz, the electrical wavelength in the immediate vicinity of the edge of the circuit board is only a good 10 millimetres with an effective dielectric constant close to three, the coupled lengths at the edge of the board add up to around 3 millimetres from the finger length and fringe field, which corresponds to a coupling window of around 0.28 in relation to the wavelength.
This short coupled segment generates an interference voltage at the near end that is proportional to the coupling constant and the signal amplitude. The additional 3 dB at the edge of the card are therefore immediately noticeable and the receiver’s filters must compensate for this loss. And what happens now? The available reserve shrinks, resulting in more frequent retrain sequences, increased error rates in the data link, downgrades to lower link speeds and, finally, graphics driver timeouts and even a blue screen including VIDEO_TDR_FAILURE! This is exactly the behavior I observed on the problematic card. Done!

Why the tolerance specifications are far too broad
The tolerance specifications of CEM 5.1 are mainly mechanically oriented and ensure plug-in compatibility, but electrically the window for width and gap is wide, with a nominal width of 0.60 millimeters with a usual tolerance of approximately ±0.05 millimeters, the nominal gap dimension of 0.40 millimeters is centered with a plausible corridor of 0.35 to 0.45 millimeters, these specifications alone already allow a coupling variation of around plus 14 percent to minus 11 percent relative to the nominal value, which roughly corresponds to a margin of around 2 dB at the near end; in addition, manufacturing details such as mask release, nickel gold layer structure and edge processing with chamfer are added, so that in practice further fractions of a decibel can be activated at the lower edge of the tolerances. The sum is sufficient to push marginal systems into the unstable range.
From a critical point of view, a tighter process specification would make sense for Gen 5, for example a working window of 0.40 millimetres with ±0.025 millimetres for the gap, supplemented by strict rules for mass guidance directly under the fingers and tightly gridded via rows next to the mating face.
Good PCB manufacturers can easily achieve such values with laser direct imaging and controlled galvanic deposition, and the uniformity of the hard gold plating over the nickel barrier can also be improved with suitable equipment. Yes, there is a certain amount of additional work involved, but it remains moderate in relation to the end product; this relates to additional process steps, tighter final controls and slightly more rejects. The savings effect of a wide design is measurable in the short term, but in the end it comes at a price in terms of RMA costs, reputational damage and avoidable field problems.
Summary and conclusion
The card with the narrower fingers and the wider spacing offers a perfectly adequate clamping surface both mechanically and electrically, it uses the spring geometry correctly and avoids unnecessary coupling at the mating face, the card with the wider fingers shows no real electrical advantages with an identical spring, it rather penalizes itself due to the narrower gap. The high error correction activity observed, including image losses, matches the calculated increase in near-end crosstalk of just under 3 dB; this increase uses up exactly the reserve that the receiver would need for channel scattering and ageing. The fact that both cards are formally within the tolerance range and yet react so differently can be explained by the superimposition of several tolerated influences, the functioning card is favorable in several parameters, larger gap, uniform finger width, slightly higher pad level, the problematic card accumulates unfavorable values, small gap, locally very wide fingers, lower pad level, small deviations in the card layout and in the ground connection on the mating face can further increase this difference.
Such high scattering can certainly occur within a single production batch if image alignment, etch compensation and electroplating are not perfectly homogeneous across the panel. In addition, edge processing shifts the effective geometry by a few tens of micrometers and price-aggressive manufacturers run these processes closer to the tolerance limit in order to avoid rejects. In the short term, this saves material and rework, but in the long term it increases the risk of outliers and with Gen 5, one such outlier is enough for visible panel failures.
The conclusion is therefore clear: Card 1 in Figure 1 corresponds to a conservative, electrically robust mating face with a gap of approximately 0.503 millimeters, the resulting coupling remains lower than with the nominal geometry and the link stability is high, card 2 in Figure 2 is very close to the minimum gap at around 0.358 millimeters, the coupling increases by around 40 percent (!) compared to the functioning card, which generates almost 3 dB of additional interference at the near end. The result is precisely the link breaks and driver time outs that I (and not only I) have observed. Tight production management with tolerances optimized for electrical function would significantly reduce this risk, it is technically achievable and also makes economic sense in view of the range of applications for PCI Express Gen 5. In view of the prices being charged anyway.

































67 Antworten
Kommentar
Lade neue Kommentare
Veteran
Urgestein
1
Veteran
Urgestein
Mitglied
1
Urgestein
Neuling
Moderator
1
Urgestein
Urgestein
Neuling
Urgestein
Urgestein
Urgestein
1
Mitglied
Alle Kommentare lesen unter igor´sLAB Community →