Question Diagnosing infrequent random freezes

Linguofreak

Well-known member
Joined
May 10, 2008
Messages
5,034
Reaction score
1,273
Points
188
Location
Dallas, TX
I have a ~3 year old homebuilt desktop, running Ubuntu 16.04. I had previously been running 14.04. For the past few months (starting before the upgrade) I've been experiencing random freezes when the machine is left idle for a few days (I keep it running constantly). It never seems to happen when I am actually using the machine (though given the relative infrequency of the freezes, that may be purely statistical). I am quite consistently getting *only* freezes (never any kernel panics), and there is never anything in the logs that indicates any trouble (or even any consistent chain of events) leading up to a freeze: There's just a bunch of business as usual log messages, and then from a certain time the log messages stop. I had been suspecting the freezes might have something to do with the monitor being put into powersave mode (some sort of graphics driver bug, perhaps), but after setting Xscreensaver to never put the monitor to sleep, I awoke this morning to find the machine frozen in the middle of a screensaver.

I'm now trying booting without the custom kernel that I normally use, and hoping that I'm hitting a kernel bug that isn't in the stock Ubuntu kernel.

The freeze does not appear to be anything CPU or GPU related. I've exercised both fairly well while using the machine without trouble, the screensaver the machine froze in this morning is not a computationally intensive one, and the freeze this morning is the first one not to have occurred while the monitor was in powersave mode (meaning no intensive screensavers or interactive programs). The freeze also does not appear to be RAM related. (I'd expect RAM issues to be more varied: a kernel panic here, an application crash there, and a freeze thrown in every once in a while). It does not seem to be HDD related (I've seen failed HDDs result in lockups before, but the machine doesn't tend to boot afterward, and if it does there tends to be tons of log spam about disk errors).

Once again, I'm hoping this relates to my custom kernel somehow and is entirely a software issue. If it is not, is there any item of hardware that people would tend to suspect?
 

dbeachy1

O-F Administrator
Administrator
Orbiter Contributor
Addon Developer
Donator
Beta Tester
Joined
Jan 14, 2008
Messages
9,217
Reaction score
1,563
Points
203
Location
VA
Website
alteaaerospace.com
Preferred Pronouns
he/him
In my experience, a hard lock-up at idle is almost always caused by a hardware problem -- every time I have had that happen over the years (except when it was caused by overclocking) it always ended up being a hardware issue. :(

I assume you aren't overclocking, correct? If you are, though, the first thing I would do is set the clock back to stock speed. If you are NOT overclocking, as a test you could also try underclocking your CPU by 200 MHz or so to see if that resolves the issue.

The next thing I would check is that your RAM speeds are not overclocked. For example, if your RAM is rated at DDR-1600, then make sure it's set in the BIOS to run at DDR-1600 or DDR-1333. If you are running 4 DIMMS, make sure you have it set to "2T" and not "1T" in the BIOS. Sometimes even with 2 DIMMs I have had to set it to 2T before the system was stable. You can also try, as a test, bumping your RAM speed down one step (e.g., if your RAM is running at DDR-1600, set it to DDR-1333) to see if the RAM is marginal.

Some other things I have seen cause hard lockups are:
* Failing video card or other PCI Express card (which causes the bus to crash == hard lockup).
* Failing power supply
* Power brownouts (which could be helped by a UPS)
* Failing motherboard (usually the capacitors start failing after some years)
* Failing hard drive (only saw this happen once, though).
* Failing CPU (although this seems less likely since you never see kernel panics)

Unfortunately, troubleshooting this can be difficult. You could try removing all optional expansion cards and unplugging all USB devices except keyboard and mouse in order to narrow down the hardware connected.
 

Urwumpe

Not funny anymore
Addon Developer
Donator
Joined
Feb 6, 2008
Messages
37,616
Reaction score
2,336
Points
203
Location
Wolfsburg
Preferred Pronouns
Sire
Very often, the Windows Logs can give you a hint, which hardware component is causing troubles. But this is not reliable, if you for example of memory or mainboard problems, the event log is full of contradicting messages.

An especially good cause for troubles is also an irregular power supply.
 

Artlav

Aperiodic traveller
Addon Developer
Beta Tester
Joined
Jan 7, 2008
Messages
5,790
Reaction score
780
Points
203
Location
Earth
Website
orbides.org
Preferred Pronouns
she/her
This sounds familiar, and in my case was caused by dried out caps in the power supply. So if you have a spare one, that's a good place to start.
In general, this sounds like a hardware issue, and a good approach would be to go through components and remove/replace them one by one, after doing what dbeachy1 said.
 

Xyon

Puts the Fun in Dysfunctional
Administrator
Moderator
Orbiter Contributor
Addon Developer
Webmaster
GFX Staff
Beta Tester
Joined
Aug 9, 2009
Messages
6,926
Reaction score
794
Points
203
Location
10.0.0.1
Website
www.orbiter-radio.co.uk
Preferred Pronouns
she/her
I have a ~3 year old homebuilt desktop, running Ubuntu 16.04. I had previously been running 14.04.

Very often, the Windows Logs can give you a hint

I don't think the Windows log is going to help here.

More helpfully, I have experienced similar pain, due to my disk being set into AHCI mode and the operating system not liking the driver for that - but those freezes also happened while using it. I would, since you're running linux, look into what CPU and I/O governors you're running, and try to see in dmesg or the syslog if the CPU or disk is being sent to sleep because it has no work to do in a bid to save power.
 

Linguofreak

Well-known member
Joined
May 10, 2008
Messages
5,034
Reaction score
1,273
Points
188
Location
Dallas, TX
In my experience, a hard lock-up at idle is almost always caused by a hardware problem -- every time I have had that happen over the years (except when it was caused by overclocking) it always ended up being a hardware issue. :(

I assume you aren't overclocking, correct? If you are, though, the first thing I would do is set the clock back to stock speed. If you are NOT overclocking, as a test you could also try underclocking your CPU by 200 MHz or so to see if that resolves the issue.

No, I'm not overclocking, and never have. The CPU governors pull all the way back to 800 MHz (on a 3 GHz chip) at idle, when the failures seem to mostly be occurring, so I doubt it's a problem with stability at max clock rate.


I don't think the Windows log is going to help here.

Yeah, even aside from it not being a Windows system, there's nothing obvious in the logs.

More helpfully, I have experienced similar pain, due to my disk being set into AHCI mode and the operating system not liking the driver for that - but those freezes also happened while using it. I would, since you're running linux, look into what CPU and I/O governors you're running, and try to see in dmesg or the syslog if the CPU or disk is being sent to sleep because it has no work to do in a bid to save power.

If it were a disk mode issue, I'd think the problem would have developed earlier. Funny thing is, though, the first freeze I can recall coincided (I think) with the failure of a USB backup disk. That almost makes me wonder if a power issue on the computer put a stray voltage on the USB cable that fried the disk, or if the disk (separately powered) dumped something onto the cable that damaged the mobo or power supply. The other alternative is that my UPS (or its predecessor, which failed loudly) gave both nasty power that screwed then up. In any of those cases, I haven't noticed any failures on other equipment on that desk, which might be expected.

All of the freezes have occurred since I started using a specific custom kernel, so I'm hoping reverting to stock will help.
 

dbeachy1

O-F Administrator
Administrator
Orbiter Contributor
Addon Developer
Donator
Beta Tester
Joined
Jan 14, 2008
Messages
9,217
Reaction score
1,563
Points
203
Location
VA
Website
alteaaerospace.com
Preferred Pronouns
he/him
One thing to remember about modern CPUs and their auto-clock speed adjustments: the CPU also lowers Vcore (i.e., core voltage) when throttling down its clock speed, so it is quite possible for a marginal CPU to be just as likely to lock up / crash at idle clock speeds as at full speed. In fact, I had that happen more than once with CPUs that I was overlocking: they would be fine in games, but then lock up when I was just idling or browsing the Internet, because even though the clock speed was lower, the Vcore was lower as well. The solution in my case was a slight Vcore bump and/or a reduction in the overclock speed.

Although this happened to me during overclocking testing, the same thing happens at stock speeds in that the CPU reduces its core voltage as it lowers its clock speed. So if your lockups turn out to not be caused by your Linux kernel, you may want to try downclocking your CPU just as a test.
 

Linguofreak

Well-known member
Joined
May 10, 2008
Messages
5,034
Reaction score
1,273
Points
188
Location
Dallas, TX
One thing to remember about modern CPUs and their auto-clock speed adjustments: the CPU also lowers Vcore (i.e., core voltage) when throttling down its clock speed, so it is quite possible for a marginal CPU to be just as likely to lock up / crash at idle clock speeds as at full speed. In fact, I had that happen more than once with CPUs that I was overlocking: they would be fine in games, but then lock up when I was just idling or browsing the Internet, because even though the clock speed was lower, the Vcore was lower as well. The solution in my case was a slight Vcore bump and/or a reduction in the overclock speed.

Although this happened to me during overclocking testing, the same thing happens at stock speeds in that the CPU reduces its core voltage as it lowers its clock speed. So if your lockups turn out to not be caused by your Linux kernel, you may want to try downclocking your CPU just as a test.

The power supply will probably be the next thing I try if I can rule out the kernel, as it seems to be the most consistent thing in the various responses I've received. I know that there has been at least one event with potential to damage the power supply, the loud UPS failure I mentioned in my last post (though the lockups started a while after that incident). UPSes are supposed to protect your equipment from power transients, but with the flash and bang my last one made, I don't trust it not to have caused one.
 

Artlav

Aperiodic traveller
Addon Developer
Beta Tester
Joined
Jan 7, 2008
Messages
5,790
Reaction score
780
Points
203
Location
Earth
Website
orbides.org
Preferred Pronouns
she/her
UPSes are supposed to protect your equipment from power transients, but with the flash and bang my last one made, I don't trust it not to have caused one.
Assuming it's a good UPS, the flash and bang inside of it means a flash and bang not happening inside the PC - there are many last-ditch protection devices which work with a bang, i.e. the crowbar circuit.
Assuming...
 

steph

Well-known member
Joined
Mar 22, 2008
Messages
1,394
Reaction score
715
Points
113
Location
Vendee, France
I might be talking crap here, since I'm by no means an expert, but have you checked the voltage? A friend of mine had to install a voltage regulator, because apparently it didn't stay at 220V all the time, and this was wreaking havoc on the electronics.
 
Top