[arch-general] Troubleshooting random crash
Hello all, I'm having an issue on my work machine where for the past week, each day when I come in the computer freezes. I'm using KDE 4.9.5 along with 3.7.5-1-ARCH. I believe this started happening after a recent update but I can't know for sure and I can't really reproduce it...it's happening at seemingly random times, however the pattern I see developing is after my PC has been running all day, I'll come in the next morning open Thunderbird and the system just freezes. I can't ctrl+alt+F[1-12] or do anything. Using another system, I'm able to telnet to port 22 on the "frozen" box (I run ssh on this box) but cannot get connected via ssh. My question is, how do I go about troubleshooting this issue? I know there are so many factors involved that it can really be anything, but I'd think if it were an X11/KDE/Thunderbird issue, I'd at least be able to switch to a console and/or SSH to the box. I also don't know if it's in fact a kernel crash because I can telnet to ports on the box. I AM running root BtrFS which I think may be the culprit. I know I know, not production ready, etc. but I'm not actually using this box as a production server, just as my main work desktop so I figured I'd be fine. Also, again, I didn't start having issues until maybe 2 weeks ago -- and I've had this box running since September '12. Luckily I haven't had any filesystem corruption AFAIK, but I'm thinking it's only a matter of time. Any tips on things I could set up to try to capture some sort of output or perhaps a kernel dump (if it's the kernel crashing)? I'm just at a loss here, and I'm not accustom to needing to reboot my Linux PC so ofter (though I have to troubleshoot a certain other OS daily, and that's usually the resolution :p ). -- Andre Goree andre@drenet.info
There are obvious gaps in your report; fixing them would be a good first step towards better understanding the problem. For instance: [2013-02-06 10:57:59 -0500] Andre Goree:
I believe this started happening after a recent update but I can't know for sure and I can't really reproduce it...
Give a window for when you started noticing the symptoms. See in /var/log/pacman.log what packages were upgraded then. Downgrade them and see if the issue persists.
Using another system, I'm able to telnet to port 22 on the "frozen" box (I run ssh on this box) but cannot get connected via ssh.
What does "able to telnet to port 22" means? Do you get the SSH banner? If yes, when is the SSH connection hanged/interrupted (ssh -vvv)? What do the SSHD logs show on the server side?
I'm not actually using this box as a production server, just as my main work desktop
So you produce nothing at work?
Any tips on things I could set up to try to capture some sort of output or perhaps a kernel dump (if it's the kernel crashing)?
How about looking at the system logs to see what your system was up to just before a crash? -- Gaetan
On 02/06/13 16:29, Gaetan Bisson wrote:
There are obvious gaps in your report; fixing them would be a good first step towards better understanding the problem. For instance:
[2013-02-06 10:57:59 -0500] Andre Goree:
I believe this started happening after a recent update but I can't know for sure and I can't really reproduce it...
Give a window for when you started noticing the symptoms.
See in /var/log/pacman.log what packages were upgraded then.
Downgrade them and see if the issue persists.
As I said in the original mail: "Also, again, I didn't start having issues until maybe 2 weeks ago" Here is my pacman.log file from that time forward: http://www.drenet.net/paclog.txt Not really too keen on downgrading a bunch of packages that might break dependencies and provide a REAL mess. If I have to go through that long process, I'd rather just reinstall -- which at this point I'm planning to do anyways.
Using another system, I'm able to telnet to port 22 on the "frozen" box (I run ssh on this box) but cannot get connected via ssh.
What does "able to telnet to port 22" means? Do you get the SSH banner?
If yes, when is the SSH connection hanged/interrupted (ssh -vvv)?
What do the SSHD logs show on the server side?
That means, from another box on the network (my laptop in this instance), I'm able to telnet to the hung/frozen desktop. Yes I got the SSH banner. I tried 'ssh -v' when this happened earlier today, and it hung after "Connecting to sideswipe-DT". Next time I shall try -vvv. Nothing is produced in the SSH logs on the desktop. In fact it seems all system processes hang because no logs are produced after the issue rears it's ugly head.
I'm not actually using this box as a production server, just as my main work desktop
So you produce nothing at work?
Not sure if you're just being an ass or not, however if you aren't: that has nothing at all to do with the issue and I merely wanted to establish _why_ I was using btrfs on a machine that I have running at my job -- which is _also_ inconsequential in the context of my email. If you indeed were being an ass, congrats, you succeeded.
Any tips on things I could set up to try to capture some sort of output or perhaps a kernel dump (if it's the kernel crashing)?
How about looking at the system logs to see what your system was up to just before a crash?
I've done that, with no real hints. That's the first thing any linux admin does when confronted with an issue such as this, no? Is there perhaps a way to build Thunderbird with debug symbols or some kind of logging? I seem to recall opening Thunderbird each time this issue has showed up. I love Arch for what it is and I actually run it on the aforementioned laptop (an Asus Zenbook) that I used to telnet. It's a great OS and if you know what you're doing you can minimize the hazards of running it on a production machine. It's been for the most part rock-solid in my experience, which is why I'm perplexed by this current issue. I'm ready to blame btrfs b/c that's the only issue I see with my setup -- I also have a tough time running a virtual machine on this box which I believe is also due to btrfs. Anyways, thanks for what help and guidance you did provide, it's appreciated. -- Andre Goree andre@drenet.info
[2013-02-06 19:06:45 -0500] Andre Goree:
Not really too keen on downgrading a bunch of packages that might break dependencies and provide a REAL mess. If I have to go through that long process, I'd rather just reinstall -- which at this point I'm planning to do anyways.
Well, there is little point in posting to this list if you have no motivation to actually investigate the problem. For starters, you've upgraded Linux from 3.6.11 to 3.7.4 in the window when you report the issue appeared; from the symptoms you described, it's a likely suspect. Downgrading it is far from being a "REAL mess": you only need to downgrade/rebuild the external modules you really need (probably none).
In fact it seems all system processes hang because no logs are produced after the issue rears it's ugly head.
Ah. So that would mean your issue is I/O related, then?
So you produce nothing at work?
Not sure if you're just being an ass or not, however if you aren't: that has nothing at all to do with the issue and I merely wanted to establish _why_ I was using btrfs on a machine that I have running at my job -- which is _also_ inconsequential in the context of my email. If you indeed were being an ass, congrats, you succeeded.
Once you were done being offended, you could have looked for the meaning behind the words I used: that your "main work desktop" really qualifies as "a production server". But, of course, as you have so unequivocally declared, btrfs has absolutely "nothing at all to do with the issue". And your statement above implying that the problem is I/O related is just a coincidence. Reporting issues is worthless when speculation is substituted for hard data. For example, a good report would have gone: "I believe this issue is unrelated to btrfs being my root filesystem since, on another Arch machine running ext3, I observe the following identical symptoms: first, `ssh -vvv` hangs at exactly the same point; second..."
How about looking at the system logs to see what your system was up to just before a crash?
I've done that, with no real hints. That's the first thing any linux admin does when confronted with an issue such as this, no?
Sure. But your first post gave no indication that you did that.
Is there perhaps a way to build Thunderbird with debug symbols or some kind of logging? I seem to recall opening Thunderbird each time this issue has showed up.
Well it would be nice to confirm that it is indeed at fault; downgrading it is certainly not a "REAL mess" either. You can certainly also build it with debug symbols: in the PKGBUILD (or makepkg.conf), set CXXFLAGS='' LDFLAGS='' CFLAGS='-g' and remove the strip option.
I'm ready to blame btrfs b/c that's the only issue I see with my setup -- I also have a tough time running a virtual machine on this box which I believe is also due to btrfs.
Didn't you write just a few lines ago that btrfs "has nothing at all to do with the issue"? Wild guess: your thunderbird mail database is huge (just like the disk image of your virtual machine - although I cannot really know what you mean by "tough time") and your btrfs has problems dealing with such big files (for instance, because your filesystem nearly full). To confirm, start thunderbird with an empty profile (such as by renaming ~/.mozilla into ~/.mozilla.old) and see what happens. -- Gaetan
On 02/06/13 20:14, Gaetan Bisson wrote:
[2013-02-06 19:06:45 -0500] Andre Goree:
Not really too keen on downgrading a bunch of packages that might break dependencies and provide a REAL mess. If I have to go through that long process, I'd rather just reinstall -- which at this point I'm planning to do anyways.
Well, there is little point in posting to this list if you have no motivation to actually investigate the problem.
For starters, you've upgraded Linux from 3.6.11 to 3.7.4 in the window when you report the issue appeared; from the symptoms you described, it's a likely suspect. Downgrading it is far from being a "REAL mess": you only need to downgrade/rebuild the external modules you really need (probably none).
Indeed there isn't, and surely even less point in replying to said post if in fact I had no motivation. Given that I'm replying, I'd probably like to avoid reinstalling if at all possible. I like the idea of downgrading just the kernel -- obviously I mean downgrading every package I've upgraded since 1/21 was not something I wanted to undertake. I'll try this tomorrow.
In fact it seems all system processes hang because no logs are produced after the issue rears it's ugly head.
Ah. So that would mean your issue is I/O related, then?
It would seem so, yes. I hinted to this at the end of my last reply as well.
So you produce nothing at work?
Not sure if you're just being an ass or not, however if you aren't: that has nothing at all to do with the issue and I merely wanted to establish _why_ I was using btrfs on a machine that I have running at my job -- which is _also_ inconsequential in the context of my email. If you indeed were being an ass, congrats, you succeeded.
Once you were done being offended, you could have looked for the meaning behind the words I used: that your "main work desktop" really qualifies as "a production server".
But, of course, as you have so unequivocally declared, btrfs has absolutely "nothing at all to do with the issue". And your statement above implying that the problem is I/O related is just a coincidence.
I think you mis-comprehended my reply. Following the context, I merely meant that distinguishing my system from a production server and explaining why I was running btrfs on this system was inconsequential to the issue at hand. Which is still true. I never said nor meant it to be understood that I believed btrfs not to be the problem. In fact, the opposite is true. So, for the sake of clarity, I never declared (and certainly not unequivocally) "btrfs has absolutely nothing at all to do with the issue", but rather, my distinctions and reasons for running btrfs have nothing to do with the issue. Not sure how you got that mixed up, especially given the later part of my reply.
Reporting issues is worthless when speculation is substituted for hard data. For example, a good report would have gone: "I believe this issue is unrelated to btrfs being my root filesystem since, on another Arch machine running ext3, I observe the following identical symptoms: first, `ssh -vvv` hangs at exactly the same point; second..."
I'll be sure to raise my reporting standards the next time I'd like help from an Arch list, my apologies.
How about looking at the system logs to see what your system was up to just before a crash?
I've done that, with no real hints. That's the first thing any linux admin does when confronted with an issue such as this, no?
Sure. But your first post gave no indication that you did that.
Indeed, I need to raise my reporting standards, I figured a lot of stuff was implied but I now know I must be much clearer. Again, my apologies.
Is there perhaps a way to build Thunderbird with debug symbols or some kind of logging? I seem to recall opening Thunderbird each time this issue has showed up.
Well it would be nice to confirm that it is indeed at fault; downgrading it is certainly not a "REAL mess" either. You can certainly also build it with debug symbols: in the PKGBUILD (or makepkg.conf), set CXXFLAGS='' LDFLAGS='' CFLAGS='-g' and remove the strip option.
Given that thunderbird wasn't upgraded in the time that this issue began, not sure a downgrade would help but it may be worth a shot. Thanks for the pointers on building with debug symbols.
I'm ready to blame btrfs b/c that's the only issue I see with my setup -- I also have a tough time running a virtual machine on this box which I believe is also due to btrfs.
Didn't you write just a few lines ago that btrfs "has nothing at all to do with the issue"?
I most certainly did not, there's an obvious misunderstanding here.
Wild guess: your thunderbird mail database is huge (just like the disk image of your virtual machine - although I cannot really know what you mean by "tough time") and your btrfs has problems dealing with such big files (for instance, because your filesystem nearly full). To confirm, start thunderbird with an empty profile (such as by renaming ~/.mozilla into ~/.mozilla.old) and see what happens.
The thing is, this doesn't happen everytime I start thunderbird -- rather, seemingly, after the system has been up for a long period (>20 hrs or so). The filesystem is not nearly full either, though it does contain a lot of data. I'm thinking downgrading to 3.6.x will help a bit. I'm going to look for btrfs bugs in 3.7.x and see if anyone else has been having a similar issue as well. Thanks for the assistance. -- Andre Goree andre@drenet.info
Andre Goree <andre@drenet.info> wrote:
[2013-02-06 19:06:45 -0500] Andre Goree:
Not really too keen on downgrading a bunch of packages that might break dependencies and provide a REAL mess. If I have to go through that long process, I'd rather just reinstall -- which at this point I'm
On 02/06/13 20:14, Gaetan Bisson wrote: planning
to do anyways.
Well, there is little point in posting to this list if you have no motivation to actually investigate the problem.
For starters, you've upgraded Linux from 3.6.11 to 3.7.4 in the window when you report the issue appeared; from the symptoms you described, it's a likely suspect. Downgrading it is far from being a "REAL mess": you only need to downgrade/rebuild the external modules you really need (probably none).
Indeed there isn't, and surely even less point in replying to said post if in fact I had no motivation. Given that I'm replying, I'd probably like to avoid reinstalling if at all possible. I like the idea of downgrading just the kernel -- obviously I mean downgrading every package I've upgraded since 1/21 was not something I wanted to undertake. I'll try this tomorrow.
In fact it seems all system processes hang because no logs are produced after the issue rears it's ugly head.
Ah. So that would mean your issue is I/O related, then?
It would seem so, yes. I hinted to this at the end of my last reply as well.
So you produce nothing at work?
Not sure if you're just being an ass or not, however if you aren't: that has nothing at all to do with the issue and I merely wanted to establish _why_ I was using btrfs on a machine that I have running
at my
job -- which is _also_ inconsequential in the context of my email. If you indeed were being an ass, congrats, you succeeded.
Once you were done being offended, you could have looked for the meaning behind the words I used: that your "main work desktop" really qualifies as "a production server".
But, of course, as you have so unequivocally declared, btrfs has absolutely "nothing at all to do with the issue". And your statement above implying that the problem is I/O related is just a coincidence.
I think you mis-comprehended my reply. Following the context, I merely meant that distinguishing my system from a production server and explaining why I was running btrfs on this system was inconsequential to the issue at hand. Which is still true. I never said nor meant it to be understood that I believed btrfs not to be the problem. In fact, the opposite is true.
So, for the sake of clarity, I never declared (and certainly not unequivocally) "btrfs has absolutely nothing at all to do with the issue", but rather, my distinctions and reasons for running btrfs have nothing to do with the issue. Not sure how you got that mixed up, especially given the later part of my reply.
Reporting issues is worthless when speculation is substituted for hard data. For example, a good report would have gone: "I believe this issue is unrelated to btrfs being my root filesystem since, on another Arch machine running ext3, I observe the following identical symptoms: first, `ssh -vvv` hangs at exactly the same point; second..."
I'll be sure to raise my reporting standards the next time I'd like help from an Arch list, my apologies.
How about looking at the system logs to see what your system was up to just before a crash?
I've done that, with no real hints. That's the first thing any linux admin does when confronted with an issue such as this, no?
Sure. But your first post gave no indication that you did that.
Indeed, I need to raise my reporting standards, I figured a lot of stuff was implied but I now know I must be much clearer. Again, my apologies.
Is there perhaps a way to build Thunderbird with debug symbols or some kind of logging? I seem to recall opening Thunderbird each time this issue has showed up.
Well it would be nice to confirm that it is indeed at fault; downgrading it is certainly not a "REAL mess" either. You can certainly also build it with debug symbols: in the PKGBUILD (or makepkg.conf), set CXXFLAGS='' LDFLAGS='' CFLAGS='-g' and remove the strip option.
Given that thunderbird wasn't upgraded in the time that this issue began, not sure a downgrade would help but it may be worth a shot. Thanks for the pointers on building with debug symbols.
I'm ready to blame btrfs b/c that's the only issue I see with my setup -- I also have a tough time running a virtual machine on this box which I believe is also due to btrfs.
Didn't you write just a few lines ago that btrfs "has nothing at all to do with the issue"?
I most certainly did not, there's an obvious misunderstanding here.
Wild guess: your thunderbird mail database is huge (just like the
disk
image of your virtual machine - although I cannot really know what you mean by "tough time") and your btrfs has problems dealing with such big files (for instance, because your filesystem nearly full). To confirm, start thunderbird with an empty profile (such as by renaming ~/.mozilla into ~/.mozilla.old) and see what happens.
The thing is, this doesn't happen everytime I start thunderbird -- rather, seemingly, after the system has been up for a long period (>20 hrs or so). The filesystem is not nearly full either, though it does contain a lot of data. I'm thinking downgrading to 3.6.x will help a bit. I'm going to look for btrfs bugs in 3.7.x and see if anyone else has been having a similar issue as well. Thanks for the assistance.
For downgrading, I have found the Arch Rollback Machine quite handy. You can choose a date to roll back to, sync and then have pacman reinstall all packages that it finds are newer than its database. This is of course if there was not a significant change like potentially the recent filesyatem update. I would certainly try doing just the kernel first though, as that is even easier. Just thought I would mention that amazing tool we have in our debugging arsenal. Regards, -- Curtis Shimamoto sugar.and.scruffy@gmail.com
On 02/06/13 22:45, Curtis Shimamoto wrote:
Andre Goree <andre@drenet.info> wrote:
[2013-02-06 19:06:45 -0500] Andre Goree:
Not really too keen on downgrading a bunch of packages that might break dependencies and provide a REAL mess. If I have to go through that long process, I'd rather just reinstall -- which at this point I'm
On 02/06/13 20:14, Gaetan Bisson wrote: planning
to do anyways.
Well, there is little point in posting to this list if you have no motivation to actually investigate the problem.
For starters, you've upgraded Linux from 3.6.11 to 3.7.4 in the window when you report the issue appeared; from the symptoms you described, it's a likely suspect. Downgrading it is far from being a "REAL mess": you only need to downgrade/rebuild the external modules you really need (probably none).
Indeed there isn't, and surely even less point in replying to said post if in fact I had no motivation. Given that I'm replying, I'd probably like to avoid reinstalling if at all possible. I like the idea of downgrading just the kernel -- obviously I mean downgrading every package I've upgraded since 1/21 was not something I wanted to undertake. I'll try this tomorrow.
In fact it seems all system processes hang because no logs are produced after the issue rears it's ugly head.
Ah. So that would mean your issue is I/O related, then?
It would seem so, yes. I hinted to this at the end of my last reply as well.
So you produce nothing at work?
Not sure if you're just being an ass or not, however if you aren't: that has nothing at all to do with the issue and I merely wanted to establish _why_ I was using btrfs on a machine that I have running
at my
job -- which is _also_ inconsequential in the context of my email. If you indeed were being an ass, congrats, you succeeded.
Once you were done being offended, you could have looked for the meaning behind the words I used: that your "main work desktop" really qualifies as "a production server".
But, of course, as you have so unequivocally declared, btrfs has absolutely "nothing at all to do with the issue". And your statement above implying that the problem is I/O related is just a coincidence.
I think you mis-comprehended my reply. Following the context, I merely meant that distinguishing my system from a production server and explaining why I was running btrfs on this system was inconsequential to the issue at hand. Which is still true. I never said nor meant it to be understood that I believed btrfs not to be the problem. In fact, the opposite is true.
So, for the sake of clarity, I never declared (and certainly not unequivocally) "btrfs has absolutely nothing at all to do with the issue", but rather, my distinctions and reasons for running btrfs have nothing to do with the issue. Not sure how you got that mixed up, especially given the later part of my reply.
Reporting issues is worthless when speculation is substituted for hard data. For example, a good report would have gone: "I believe this issue is unrelated to btrfs being my root filesystem since, on another Arch machine running ext3, I observe the following identical symptoms: first, `ssh -vvv` hangs at exactly the same point; second..."
I'll be sure to raise my reporting standards the next time I'd like help from an Arch list, my apologies.
How about looking at the system logs to see what your system was up to just before a crash?
I've done that, with no real hints. That's the first thing any linux admin does when confronted with an issue such as this, no?
Sure. But your first post gave no indication that you did that.
Indeed, I need to raise my reporting standards, I figured a lot of stuff was implied but I now know I must be much clearer. Again, my apologies.
Is there perhaps a way to build Thunderbird with debug symbols or some kind of logging? I seem to recall opening Thunderbird each time this issue has showed up.
Well it would be nice to confirm that it is indeed at fault; downgrading it is certainly not a "REAL mess" either. You can certainly also build it with debug symbols: in the PKGBUILD (or makepkg.conf), set CXXFLAGS='' LDFLAGS='' CFLAGS='-g' and remove the strip option.
Given that thunderbird wasn't upgraded in the time that this issue began, not sure a downgrade would help but it may be worth a shot. Thanks for the pointers on building with debug symbols.
I'm ready to blame btrfs b/c that's the only issue I see with my setup -- I also have a tough time running a virtual machine on this box which I believe is also due to btrfs.
Didn't you write just a few lines ago that btrfs "has nothing at all to do with the issue"?
I most certainly did not, there's an obvious misunderstanding here.
Wild guess: your thunderbird mail database is huge (just like the
disk
image of your virtual machine - although I cannot really know what you mean by "tough time") and your btrfs has problems dealing with such big files (for instance, because your filesystem nearly full). To confirm, start thunderbird with an empty profile (such as by renaming ~/.mozilla into ~/.mozilla.old) and see what happens.
The thing is, this doesn't happen everytime I start thunderbird -- rather, seemingly, after the system has been up for a long period (>20 hrs or so). The filesystem is not nearly full either, though it does contain a lot of data. I'm thinking downgrading to 3.6.x will help a bit. I'm going to look for btrfs bugs in 3.7.x and see if anyone else has been having a similar issue as well. Thanks for the assistance.
For downgrading, I have found the Arch Rollback Machine quite handy. You can choose a date to roll back to, sync and then have pacman reinstall all packages that it finds are newer than its database. This is of course if there was not a significant change like potentially the recent filesyatem update.
I would certainly try doing just the kernel first though, as that is even easier. Just thought I would mention that amazing tool we have in our debugging arsenal.
Regards,
Awesome, thanks for the suggestion! -- Andre Goree andre@drenet.info
On Thu, Feb 7, 2013 at 1:06 AM, Andre Goree <andre@drenet.info> wrote:
On 02/06/13 16:29, Gaetan Bisson wrote:
There are obvious gaps in your report; fixing them would be a good first step towards better understanding the problem. For instance:
[2013-02-06 10:57:59 -0500] Andre Goree:
I believe this started happening after a recent update but I can't know for sure and I can't really reproduce it...
Give a window for when you started noticing the symptoms.
See in /var/log/pacman.log what packages were upgraded then.
Downgrade them and see if the issue persists.
As I said in the original mail: "Also, again, I didn't start having issues until maybe 2 weeks ago"
Here is my pacman.log file from that time forward: http://www.drenet.net/paclog.txt
Not really too keen on downgrading a bunch of packages that might break dependencies and provide a REAL mess. If I have to go through that long process, I'd rather just reinstall -- which at this point I'm planning to do anyways.
Using another system, I'm able to telnet to port 22 on the "frozen" box (I run ssh on this box) but cannot get connected via ssh.
What does "able to telnet to port 22" means? Do you get the SSH banner?
If yes, when is the SSH connection hanged/interrupted (ssh -vvv)?
What do the SSHD logs show on the server side?
That means, from another box on the network (my laptop in this instance), I'm able to telnet to the hung/frozen desktop. Yes I got the SSH banner. I tried 'ssh -v' when this happened earlier today, and it hung after "Connecting to sideswipe-DT". Next time I shall try -vvv. Nothing is produced in the SSH logs on the desktop. In fact it seems all system processes hang because no logs are produced after the issue rears it's ugly head.
Maybe your disk is nearly full, and we all know what happens when the root file system is full. Maybe it is due to a temporary file growing unlimited, so a reboot will delete it and avoid the problem for a few hours. Note that btrfs has a funny concept [1] of free space. And more so if you use snapshots... I think it is worth looking into it, just in case. Best regards. -- Rodrigo [1]: https://btrfs.wiki.kernel.org/index.php/FAQ#Why_are_there_so_many_ways_to_ch...
participants (4)
-
Andre Goree
-
Curtis Shimamoto
-
Gaetan Bisson
-
Rodrigo Rivas