[arch-general] Help diagnosing kworker 'bug'

newer
[arch-general] bugs.archlinux.org...

Oon-Ee Ng

5 Aug 2014 5 Aug '14

11:37 p.m.

Yesterday night I noticed (just before performing an update my conky showing high continuous writing to root. iotop -Pa shows this:- Total DISK READ : 0.00 B/s | Total DISK WRITE : 8.93 M/s Actual DISK READ: 0.00 B/s | Actual DISK WRITE: 11.06 M/s PID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND 112 be/4 root 0.00 B 64.91 M 0.00 % 24.34 % [kworker/u16:3] 11936 be/4 root 0.00 B 0.00 B 0.00 % 0.06 % [kworker/1:1] 28794 be/4 root 0.00 B 36.00 K 0.00 % 0.00 % [kworker/u16:1] This is after roughly 6-7 seconds, and 65 MB has already been written by that kworker thread. As I said, this was already happening before an upgrade. I ran the upgrade anyway, which upgraded linux to 3.16-2, and still got the same thing. Yes, I'm using [testing]. Any ideas on how to proceed? Next thing I'm going to try is downgrading linux to 3.15, but I thought I'd post this here first in case I don't make it back.

Show replies by date

Oon-Ee Ng

5 Aug 5 Aug

11:57 p.m.

On Wed, Aug 6, 2014 at 7:37 AM, Oon-Ee Ng <ngoonee.talk@gmail.com> wrote:

...

Any ideas on how to proceed? Next thing I'm going to try is downgrading linux to 3.15, but I thought I'd post this here first in case I don't make it back.

And I made it back alive. 3.15.8-1 does not have the same problem. Also downgraded bbswitch to 0.8-14, nvidia t 340.24-3, linux-headers to 3.15.8-1, virtualbox-host-modules to 4.3.14-3, vhba-module to 20140629-5, and lirc-utils to 1:0.9.0-75. Should I open a bug report? I don't think I have sufficient detail, in particular because I don't know how to track down what caused that kworker process. My system is an i7 laptop, the hard disc partitions (including root) are btrfs formatted. Previous Arch Linux bbs posts regarding kworker I/O activity is not as serious as what I'm seeing and seemed to affect ext4. Again, suggestions are appreciated.

Anatol Pomozov

6 Aug 6 Aug

2:40 a.m.

Hi On Tue, Aug 5, 2014 at 4:37 PM, Oon-Ee Ng <ngoonee.talk@gmail.com> wrote:

...

Yesterday night I noticed (just before performing an update my conky showing high continuous writing to root. iotop -Pa shows this:-

Total DISK READ : 0.00 B/s | Total DISK WRITE : 8.93 M/s Actual DISK READ: 0.00 B/s | Actual DISK WRITE: 11.06 M/s PID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND 112 be/4 root 0.00 B 64.91 M 0.00 % 24.34 % [kworker/u16:3] 11936 be/4 root 0.00 B 0.00 B 0.00 % 0.06 % [kworker/1:1] 28794 be/4 root 0.00 B 36.00 K 0.00 % 0.00 % [kworker/u16:1]

This is after roughly 6-7 seconds, and 65 MB has already been written by that kworker thread.

How long this IO activity takes? Could it be some kind of automatic defragmentation or some other internal btrfs background optimization? A good idea is to check btrfs changelog for 3.16 kernel release.

...

As I said, this was already happening before an upgrade. I ran the upgrade anyway, which upgraded linux to 3.16-2, and still got the same thing. Yes, I'm using [testing].

Any ideas on how to proceed? Next thing I'm going to try is downgrading linux to 3.15, but I thought I'd post this here first in case I don't make it back.

'perf' is a great and very powerful tool that allow to debug problems like this. Run '# perf top -g -p $PID' and it will show where the process spends *cpu cycles*. It should be enough to understand what kworker thread does. For all curious minds I highly recommend to read this tutorial https://perf.wiki.kernel.org/index.php/Tutorial

Oon-Ee Ng

7:43 a.m.

On Wed, Aug 6, 2014 at 10:40 AM, Anatol Pomozov <anatol.pomozov@gmail.com> wrote:

...

Hi

On Tue, Aug 5, 2014 at 4:37 PM, Oon-Ee Ng <ngoonee.talk@gmail.com> wrote:

...
This is after roughly 6-7 seconds, and 65 MB has already been written by that kworker thread.

How long this IO activity takes? Could it be some kind of automatic defragmentation or some other internal btrfs background optimization? A good idea is to check btrfs changelog for 3.16 kernel release.

It's continuous and ran for at least 2 hours at most.

...

'perf' is a great and very powerful tool that allow to debug problems like this. Run '# perf top -g -p $PID' and it will show where the process spends *cpu cycles*. It should be enough to understand what kworker thread does. For all curious minds I highly recommend to read this tutorial https://perf.wiki.kernel.org/index.php/Tutorial

Thanks, if my boy gets to sleep early tonight I'll do that.

Oon-Ee Ng

9 Aug 9 Aug

1:58 a.m.

On Wed, Aug 6, 2014 at 3:43 PM, Oon-Ee Ng <ngoonee.talk@gmail.com> wrote:

...

On Wed, Aug 6, 2014 at 10:40 AM, Anatol Pomozov <anatol.pomozov@gmail.com> wrote:

...
'perf' is a great and very powerful tool that allow to debug problems like this. Run '# perf top -g -p $PID' and it will show where the process spends *cpu cycles*. It should be enough to understand what kworker thread does. For all curious minds I highly recommend to read this tutorial https://perf.wiki.kernel.org/index.php/Tutorial

Thanks, if my boy gets to sleep early tonight I'll do that.

Having tried that out, I don't really understand the output. It seems the first column is CPU usage and the second is...? IO? Anyway these are the top 3 things in my output after a short amount of time. Other things which are low in CPU usage and high in the second column are find_next_zero_bit and _raw_spin_lock. Not sure what I should glean from this. + 17.74% 0.10% [kernel] [k] __filemap_fdatawrite_range + 15.04% 0.02% [kernel] [k] filemap_fdatawrite_range + 9.93% 9.93% [kernel] [k] find_next_bit

Anatol Pomozov

3:18 a.m.

Hi On Fri, Aug 8, 2014 at 6:58 PM, Oon-Ee Ng <ngoonee.talk@gmail.com> wrote:

...

On Wed, Aug 6, 2014 at 3:43 PM, Oon-Ee Ng <ngoonee.talk@gmail.com> wrote:

...
On Wed, Aug 6, 2014 at 10:40 AM, Anatol Pomozov <anatol.pomozov@gmail.com> wrote:

...
'perf' is a great and very powerful tool that allow to debug problems like this. Run '# perf top -g -p $PID' and it will show where the process spends *cpu cycles*. It should be enough to understand what kworker thread does. For all curious minds I highly recommend to read this tutorial https://perf.wiki.kernel.org/index.php/Tutorial

Thanks, if my boy gets to sleep early tonight I'll do that.

Having tried that out, I don't really understand the output. It seems the first column is CPU usage and the second is...? IO?

Anyway these are the top 3 things in my output after a short amount of time. Other things which are low in CPU usage and high in the second column are find_next_zero_bit and _raw_spin_lock. Not sure what I should glean from this. + 17.74% 0.10% [kernel] [k] __filemap_fdatawrite_range + 15.04% 0.02% [kernel] [k] filemap_fdatawrite_range + 9.93% 9.93% [kernel] [k] find_next_bit

The first column is CPU usage, not sure about the second column. Click on [+] symbol and it will show you the full call graph for this function. So it will let you understand what subsystem calls it. If it is btrfs then please contact upstream http://vger.kernel.org/vger-lists.html#linux-btrfs Another way to get more information about this problem is to use kernel traces. Let's enable block and writeback events: sudo su cd /sys/kernel/debug/tracing echo 1 > events/writeback/enable echo 1 > events/block/enable echo 1 > tracing_on cat trace_pipe there will some information like processid, inode... Maybe you'll see some pattern in the writes etc.. In any case this problem sounds like an upstream issue and it is better to contact them.

Oon-Ee Ng

24 Aug 24 Aug

9:59 a.m.

On Sat, Aug 9, 2014 at 11:18 AM, Anatol Pomozov <anatol.pomozov@gmail.com> wrote:

...

Hi

Another way to get more information about this problem is to use kernel traces. Let's enable block and writeback events:

sudo su cd /sys/kernel/debug/tracing echo 1 > events/writeback/enable echo 1 > events/block/enable echo 1 > tracing_on cat trace_pipe

there will some information like processid, inode... Maybe you'll see some pattern in the writes etc..

In any case this problem sounds like an upstream issue and it is better to contact them.

I've emailed the btrfs list and will follow up there. I could not make head nor tail of block/writeback events, seemed similar with the problematic kernel versions and the non-problematic ones. It could be that the kernel upgrade merely exposed an underlying issue. In any case, I uploaded a print screen showing the iotop and perf top readings.... hopefully someone can understand it. https://db.tt/TSh5piq6

3952

Age (days ago)

3971

Last active (days ago)

List overview

Download

6 comments

2 participants

participants (2)

Anatol Pomozov
Oon-Ee Ng