EPIC Fixes! Whatsminer Repairs using EPIC Control Boards

  • October 1st, 2024
  • No Comments

Background

At Gridless we put our miners through some exceptionally brutal environments. From clouds of dust so heavy that visibility is limited to a few meters to the very high temperatures that one expects from an African environment. However, as it turns out, the harshest conditions that our miners need to survive is the unreliable power that we give to them. 

Most mining corporations locate their operations where first and foremost there is reliable (& cheap!) power. At Gridless we take advantage of unutilized renewable energy at remote, off-grid locations. And so we are very cognizant of the unreliable power on these sites. We anticipated this when we wrote the GridlessOS which monitors available power and controls the miners accordingly. 

The GridlessOS actually does this very well and for very erratic power this means that we are waking up the miners and then putting them to sleep multiple times per day. This power cycling translates into thermal cycling of the miners and has led to a number of premature failures in our miners.

First Fixes

The first units we repaired had issues such as blown voltage supplies or non-reporting temperature sensors that were easy to diagnose. Actually fixing these circuits took a little bit of time as we learned how to deal with aluminum PCBs (which do an excellent job of dissipating heat and so do not lend themselves to being reworked!) and bring in the right tools to do the repairs.

However, this still left us with quite a few hashboards that passed our basic tests but would fail when installed in the final field miner configuration. We spent some time trying to bridge the gap between the results of various basic test platforms and the errors we were seeing in the final configuration. 

For example, we use the PicoBT tester in the initial triage step in our board repair process. We found the PicoBT would report that all the hardware on a hashboard was functioning perfectly including the ASICs, the EEPROM, and the temperature sensor. However, when we installed this same hashboard into a miner (with the stock Whatsminer control board) it would often throw a 54x or 56x error within a minute or two, if not immediately. The fault code actions are very high level and ended up not being very applicable to our boards. We spent a considerable amount of time experimenting with the Whatsminer control board API trying to narrow down what the root cause of these issues were but were unable to really make any progress. However, the small successes we had only convinced us further that we should be able to hash with these boards – we just couldn’t figure it out with the Whatsminer tools. 

EPIC to the rescue

Fortunately EPIC Blockchain makes a control board that is compatible with the Whatsminer M30x miners that we prefer for our operations. We had already played with the EPIC boards quite a bit to see how the features it offered could help them improve GridlessOS. But now, in light of the mixed results we were seeing from our other testing platforms, we realized how those features could help us in a number of ways. 

EPIC Features

The EPIC boards offer up a comprehensive suite of hashing control in a straightforward and visually clear GUI. And the API is clearly laid out for when we’re ready to update GridlessOS with our learnings.

This post won’t go into all capabilities of the EPIC dashboard but I do want to highlight some of the features we have come to depend on for our troubleshooting

  1. Individual ASIC Conditions

One of the first “Aha!” moments we had with the EPIC board came when we realized that the temperature data on our compromised boards was sometimes incomplete. And the reason we were able to observe this was due to the incredible insights given through the EPIC dashboard. EPIC allows us to observe individual hash rate, ASIC performance as a %, frequency, temperature, and voltage. So even if a board reported all ASICs on the PicoBT with the EPIC we can observe that some of these chips may not actually be mining properly or at all. Or it may have another issue with its temperature or voltage; none of these characteristics are monitored on the PicoBT nor reported from Whatsminer tools.

This feature has been invaluable in our diagnostics and helping us identify exactly which ASICs on a hashboard are not performing.

  1. Frequency & Voltage Adjustment: 

The stock Whatsminer control boards allow us to adjust hashboard frequency as a percentage of nominal. (But one has to do the work to figure out what nominal is – it’s not stated.) The control options also allow us to adjust overall peak power. The control structure seems to be to operate at the requested power and adjust the frequency up until the desired set point. But truthfully we haven’t been able to pinpoint exactly how these are correlated in the firmware.

The EPIC control board allows for direct frequency adjustment; i.e. if you want 480MHz you enter 480MHz! Depending on the power supply unit they also allow us to vary the board voltage anywhere from 10V to 13V in 10mV steps. As expected, increasing either of these variables results in higher power consumption of the hash board.

So while we were unable to ever find a consistent combination of settings that  using the stock Whatsminer control boards, with the EPIC control boards we were easily able both find a combination of frequency and voltage settings that worked for our compromised boards AND also develop an algorithm to repeatedly find this working combo for a given set of hashboards

As I was writing this post I did a quick comparison to highlight . By adjusting frequency and Power Limit on the stock Whatsminer control board, the best I was able to achieve was 11.5 TH/s with an efficiency of 82W/TH. (A VERY poor efficiency!).

With the EPIC controller on the same three boards I was able to find a frequency and voltage setting that hashed at 26 TH/s with an efficiency of 45.5 W/TH.

This is a huge improvement and the process translates clearly into steps for GridlessOS to take when it reads an error from a miner.

As a bonus EPIC allows us to control individual ASIC frequency! This is huge feature which we haven’t even begun to explore but should allow us to tune our miners to fully make use of all power available on site while also allowing for specific ASICs that are not up to spec.

  1. Resets

Many of the adjustments on the Whatsminer control board result in the whole miner restarting. For many adjustments the EPIC board is able to implement the new settings without restarting the board. This has a number of advantages since when a board is restarting it’s not hashing and restarting a hashboard results in it cooling down a bit before hashing again which is a thermal cycle. 

And, crucially for Gridless, we can now issue new settings to the entire fleet without causing large power fluctuations. Currently the GridlessOS has to change the settings of individual miners one at a time since any mass change will cause a large power fluctuation on the weak grids we are connected to as all the miners restart at the same time. With the EPIC controls we can now simplify things and issue a new setting to the fleet without fear of tripping our power source. 

  1. Just keep on hashing

Finally, and possibly most beneficially, the EPIC boards are able to work around errors cropping up in our miners. If individual ASICs aren’t hashing (as shown above) or a hashboard develops an issue the EPIC firmware has the option to ignore that ASIC or hashboard and continue hashing with the remaining resources. If the stock control board receives an error it will almost certainly shut down the entire miner leading to a fairly significant loss of overall performance and income on our side.

These features (as well as others such as fan control and perpetual tuning) mean that not only is the EPIC control board the best platform for us to diagnose our faulty boards but they are also the go-to solution for our deployments.

Moving forward

We are using these learnings to update our diagnostic procedures and improve our algorithms to not only make use of compromised hash boards but also add some preventative measures to GridlessOS. We expect that with the EPIC firmware we will also be able to mitigate some of the root causes of our issues. Additionally, the EPIC team have been extremely responsive in providing detailed suggestions and feedback. We are even working with them to develop a small test control firmware to investigate further some of the behavior we have observed.

We look forward to finally getting all these hashboards back out into the field and working for us!

Leave a Reply

Discover more from Gridless

Subscribe now to keep reading and get access to the full archive.

Continue reading