Results 1 to 10 of 15

Thread: High I/O Wait after every reboot

Hybrid View

  1. #1
    Join Date
    Oct 2008
    Beans
    15

    Angry High I/O Wait after every reboot

    Hello fellow Ubuntu users

    First and foremost, let me assure you that I've searched high and low for a solution to my problem. I don't simply jump inhere to ask for help if I can avoid it

    The Machine
    Core 2 Duo E4600
    2GB DDR2 RAM (1 stick)
    Intel ICH10R based motherboard (tried an ICH9R aswell)
    4-port SATA controller (PCI Sil 3114)
    O/S: Ubuntu Desktop x64 10.04 LTS (using 'desktop' because I like having a remote desktop)

    The Storage Setup
    Disks: Assorted selection of 9 disk. 750GB, 1000GB and 1500GB Seagate and Western Digital disks.

    The disks are joined through a standard LVM2 configuration. I don't know the LVM term, but normally you'd call it a JBOD setup.

    On that LVM device, I've put a cryptsetup device, made with the LUKS tools (aes-xts-plain 256)

    On the cryptsetup device, I've created and mounted an EXT4 partition.

    All in all, a completely standard LVM2 and LUKS setup, running EXT4

    The Problem
    After a reboot, I proceed to unlock my cryptsetup encryption device, and then mount the EXT4 partition. All is well, the mount is accessible and everything looks fine.

    I then try to send a file to the mount, via Samba. After a few hundred MB written, the I/O wait goes berserk. It stays at 50% (dual core setup remember).

    The system becomes unresponsive to network commands (can't browse samba) for about 5-10 minutes. When it finally responds, the I/O wait is gone and everything is now fine. I can write and read hundreds of GB's of data without any issues at all. I can benchmark and stress all disks perfectly fine and no logs are showing disk errors.

    I tried monitoring my disks with 'iostat -d 2' while the I/O wait was happening, and there is some slight Blk_read/s activity on 1 disk at a time. First for example /dev/sda is showing a little Blk_read/s acitivty, then it jumps to the next disk, and when every disk has show that slight Blk_read/s activity (500-800 or so) the problem is gone and the I/O wait is no more.

    The solution?
    I've tried changing motherboards, switching disks around on the controllers, checking individual disks, replacing disks and I've tried different versions of Ubuntu. The problem however persists.

    I could see it being a network issue, possibly a driver issue. But since the NIC is a standard RTL8111 on-board it seems unlike that the problem wouldn't be more widespread since this NIC is litterally being used everywhere. I did change my motherboard, so a faulty NIC seems unlikely twice in a row.

    I could also imagine "self check" feature being done by either cryptsetup/LUKS or LVM, but I can't for the life of me find ANY information about such a feature anywhere.


    I really hope someone inhere has experienced a similar problem and has been able to solve it. This problem is the sole reason I'm seriously considering Windows for my fileserver (yes, I'm going insane).

    Thank you in advance! Any help is appreciated!

    Sincerely
    Martin Moerch Aka. Atroxes

  2. #2
    Join Date
    Oct 2008
    Beans
    15

    Re: High I/O Wait after every reboot

    Shameless bump!

  3. #3
    Join Date
    Oct 2008
    Beans
    15

    Re: High I/O Wait after every reboot

    Trying a bump one last time.

  4. #4
    Join Date
    Jul 2006
    Beans
    90

    Re: High I/O Wait after every reboot

    Just fixed my iowait problem by removing fd0 from /etc/fstab

    I didn't actually have a floppy drive in there, so I'm guessing that was the problem.

    It took a lot of investigating to track down though. The program iotop was not helpful as a kernel option dealing with i/o was turned off, and as a result it could not identify i/o percentages.

    Watching dmesg tail in a TTY console eventually tipped me off.

    Try booting up, and then switching to a TTY console (ctrl+alt+f3) and logging in there. Run dmesg tail and then switch back to X (ctrl+alt+F7) and log in.

    If you're watching the system monitor applet, you can switch over to TTY 3 as soon as you see the iowait activity start up. Hopefully then you'll have some info to help you figure out what is going on.

    Good luck!

    PS> I figured if you're working with LVM stuff, you probably know full well how to get into TTY consoles... I put in thorough instructions for any noobs who come along afterward with iowait issues.
    Last edited by CyberCod; May 28th, 2010 at 01:11 PM.

  5. #5
    Join Date
    Oct 2008
    Beans
    15

    Re: High I/O Wait after every reboot

    Thanks for the suggestion!
    Unfortunately, this didn't resolve my problem.

    I removed fd0 from my fstab, and all that's left now is my root partition, /proc and my swap.

    Dmesg doesn't give me any further information. When the I/O wait starts, it looks like this in iostat:

    Code:
    IOstat
    Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
    sda               0,00         0,00         0,00          0          0
    sdc               0,00         0,00         0,00          0          0
    sdf             127,00      1016,00         0,00       1016          0
    sde               0,00         0,00         0,00          0          0
    sdd               0,00         0,00         0,00          0          0
    sdb               0,00         0,00         0,00          0          0
    sdg               2,00         0,00        32,00          0         32
    sdh               0,00         0,00         0,00          0          0
    sdi               0,00         0,00         0,00          0          0
    sdj               0,00         0,00         0,00          0          0
    dm-0            127,00      1016,00         0,00       1016          0
    dm-1            127,00      1016,00         0,00       1016          0
    
    Top
    Cpu(s):  0.0%us,  1.0%sy,  0.0%ni, 50.8%id, 48.2%wa,  0.0%hi,  0.0%si,  0.0%st

    As you can see, iostat reports a very slight read on /dev/sdf in this case. This continues on the disks one by one, until it's gone through them all. After that, the problem is gone. The process takes roughly 15-20 minutes and the LVM partition completely hangs during this time.
    Last edited by atroxes; June 2nd, 2010 at 11:53 AM.

  6. #6
    Join Date
    Oct 2008
    Beans
    15

    Re: High I/O Wait after every reboot

    UPDATE

    Just now, a few min after the IOwait started, I'm seeing some really strange behaviour in dmesg

    Code:
    [  720.390396] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [  720.390769] jbd2/dm-1-8   D 00000000ffffffff     0  3823      2 0x00000000
    [  720.390777]  ffff88007ae13d20 0000000000000046 0000000000015bc0 0000000000015bc0
    [  720.390785]  ffff8800378c31a0 ffff88007ae13fd8 0000000000015bc0 ffff8800378c2de0
    [  720.390791]  0000000000015bc0 ffff88007ae13fd8 0000000000015bc0 ffff8800378c31a0
    [  720.390798] Call Trace:
    [  720.390813]  [<ffffffff8121a64f>] jbd2_journal_commit_transaction+0x1bf/0x1270
    [  720.390821]  [<ffffffff81076e2c>] ? lock_timer_base+0x3c/0x70
    [  720.390828]  [<ffffffff81085320>] ? autoremove_wake_function+0x0/0x40
    [  720.390834]  [<ffffffff81221c3d>] kjournald2+0xbd/0x220
    [  720.390840]  [<ffffffff81085320>] ? autoremove_wake_function+0x0/0x40
    [  720.390845]  [<ffffffff81221b80>] ? kjournald2+0x0/0x220
    [  720.390850]  [<ffffffff81084fa6>] kthread+0x96/0xa0
    [  720.390856]  [<ffffffff810141ea>] child_rip+0xa/0x20
    [  720.390862]  [<ffffffff81084f10>] ? kthread+0x0/0xa0
    [  720.390866]  [<ffffffff810141e0>] ? child_rip+0x0/0x20
    [  720.390871] INFO: task flush-252:1:3826 blocked for more than 120 seconds.
    [  720.391182] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [  720.391565] flush-252:1   D 00000000ffffffff     0  3826      2 0x00000000
    [  720.391572]  ffff88007b0bda50 0000000000000046 0000000000015bc0 0000000000015bc0
    [  720.391579]  ffff880065e64890 ffff88007b0bdfd8 0000000000015bc0 ffff880065e644d0
    [  720.391586]  0000000000015bc0 ffff88007b0bdfd8 0000000000015bc0 ffff880065e64890
    [  720.391592] Call Trace:
    [  720.391598]  [<ffffffff81218961>] start_this_handle+0x251/0x4b0
    [  720.391604]  [<ffffffff81085320>] ? autoremove_wake_function+0x0/0x40
    [  720.391610]  [<ffffffff81218d61>] ? jbd2_journal_start+0x81/0x100
    [  720.391616]  [<ffffffff81218d95>] jbd2_journal_start+0xb5/0x100
    [  720.391622]  [<ffffffff811d9ff5>] ? ext4_meta_trans_blocks+0x75/0xf0
    [  720.391629]  [<ffffffff811f6d78>] ext4_journal_start_sb+0x58/0x90
    [  720.391635]  [<ffffffff811dfae3>] ext4_da_writepages+0x243/0x610
    [  720.391643]  [<ffffffff810fcf21>] do_writepages+0x21/0x40
    [  720.391648]  [<ffffffff81164cc6>] writeback_single_inode+0xf6/0x3d0
    [  720.391654]  [<ffffffff81165930>] writeback_inodes_wb+0x410/0x5e0
    [  720.391659]  [<ffffffff81165c0a>] wb_writeback+0x10a/0x1d0
    [  720.391665]  [<ffffffff810c91d3>] ? rcu_start_gp+0x53/0x1a0
    [  720.391671]  [<ffffffff810c9d2e>] ? call_rcu+0xe/0x10
    [  720.391676]  [<ffffffff81165edd>] wb_do_writeback+0x12d/0x1a0
    [  720.391681]  [<ffffffff81165fa3>] bdi_writeback_task+0x53/0xe0
    [  720.391687]  [<ffffffff8110e8b6>] bdi_start_fn+0x86/0x100
    [  720.391692]  [<ffffffff8110e830>] ? bdi_start_fn+0x0/0x100
    [  720.391697]  [<ffffffff81084fa6>] kthread+0x96/0xa0
    [  720.391702]  [<ffffffff810141ea>] child_rip+0xa/0x20
    [  720.391707]  [<ffffffff81084f10>] ? kthread+0x0/0xa0
    [  720.391712]  [<ffffffff810141e0>] ? child_rip+0x0/0x20
    [  840.390045] INFO: task jbd2/dm-1-8:3823 blocked for more than 120 seconds.
    [  840.390363] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [  840.390736] jbd2/dm-1-8   D 00000000ffffffff     0  3823      2 0x00000000
    [  840.390744]  ffff88007ae13d20 0000000000000046 0000000000015bc0 0000000000015bc0
    [  840.390752]  ffff8800378c31a0 ffff88007ae13fd8 0000000000015bc0 ffff8800378c2de0
    [  840.390759]  0000000000015bc0 ffff88007ae13fd8 0000000000015bc0 ffff8800378c31a0
    [  840.390766] Call Trace:
    [  840.390780]  [<ffffffff8121a64f>] jbd2_journal_commit_transaction+0x1bf/0x1270
    [  840.390788]  [<ffffffff81076e2c>] ? lock_timer_base+0x3c/0x70
    [  840.390795]  [<ffffffff81085320>] ? autoremove_wake_function+0x0/0x40
    [  840.390802]  [<ffffffff81221c3d>] kjournald2+0xbd/0x220
    [  840.390807]  [<ffffffff81085320>] ? autoremove_wake_function+0x0/0x40
    [  840.390813]  [<ffffffff81221b80>] ? kjournald2+0x0/0x220
    [  840.390818]  [<ffffffff81084fa6>] kthread+0x96/0xa0
    [  840.390824]  [<ffffffff810141ea>] child_rip+0xa/0x20
    [  840.390829]  [<ffffffff81084f10>] ? kthread+0x0/0xa0
    [  840.390834]  [<ffffffff810141e0>] ? child_rip+0x0/0x20
    [  840.390838] INFO: task flush-252:1:3826 blocked for more than 120 seconds.
    [  840.391150] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [  840.391535] flush-252:1   D 00000000ffffffff     0  3826      2 0x00000000
    [  840.391541]  ffff88007b0bda50 0000000000000046 0000000000015bc0 0000000000015bc0
    [  840.391548]  ffff880065e64890 ffff88007b0bdfd8 0000000000015bc0 ffff880065e644d0
    [  840.391555]  0000000000015bc0 ffff88007b0bdfd8 0000000000015bc0 ffff880065e64890
    [  840.391561] Call Trace:
    [  840.391567]  [<ffffffff81218961>] start_this_handle+0x251/0x4b0
    [  840.391573]  [<ffffffff81085320>] ? autoremove_wake_function+0x0/0x40
    [  840.391579]  [<ffffffff81218d61>] ? jbd2_journal_start+0x81/0x100
    [  840.391585]  [<ffffffff81218d95>] jbd2_journal_start+0xb5/0x100
    [  840.391591]  [<ffffffff811d9ff5>] ? ext4_meta_trans_blocks+0x75/0xf0
    [  840.391598]  [<ffffffff811f6d78>] ext4_journal_start_sb+0x58/0x90
    [  840.391604]  [<ffffffff811dfae3>] ext4_da_writepages+0x243/0x610
    [  840.391612]  [<ffffffff810fcf21>] do_writepages+0x21/0x40
    [  840.391618]  [<ffffffff81164cc6>] writeback_single_inode+0xf6/0x3d0
    [  840.391623]  [<ffffffff81165930>] writeback_inodes_wb+0x410/0x5e0
    [  840.391628]  [<ffffffff81165c0a>] wb_writeback+0x10a/0x1d0
    [  840.391635]  [<ffffffff810c91d3>] ? rcu_start_gp+0x53/0x1a0
    [  840.391640]  [<ffffffff810c9d2e>] ? call_rcu+0xe/0x10
    [  840.391645]  [<ffffffff81165edd>] wb_do_writeback+0x12d/0x1a0
    [  840.391650]  [<ffffffff81165fa3>] bdi_writeback_task+0x53/0xe0
    [  840.391656]  [<ffffffff8110e8b6>] bdi_start_fn+0x86/0x100
    [  840.391661]  [<ffffffff8110e830>] ? bdi_start_fn+0x0/0x100
    [  840.391666]  [<ffffffff81084fa6>] kthread+0x96/0xa0
    [  840.391671]  [<ffffffff810141ea>] child_rip+0xa/0x20
    [  840.391676]  [<ffffffff81084f10>] ? kthread+0x0/0xa0
    [  840.391681]  [<ffffffff810141e0>] ? child_rip+0x0/0x20
    I have no idea how to interpret this information. Something looks awfully wrong
    Last edited by atroxes; June 2nd, 2010 at 12:05 PM.

Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •