Page 1 of 2 12 LastLast
Results 1 to 10 of 15

Thread: 12.04.1 on Dell R720?

  1. #1
    Join Date
    Sep 2012
    Beans
    5

    12.04.1 on Dell R720?

    My company has been deploying some telco application software runing under Ubuntu on Dell R710s for some time. The R710 has been end-of-lifed, so for the next customer deployment we purchased Dell R720 servers.

    Is anyone successfully using 12.04.1 64-bit on a Dell R720? My configuration has a single E5-2620 processor and 16 GB RAM.

    The R720 does not appear on the certified platforms list, but I assume it will at some point since all the previous generation hardware does. Is there a projected release that will support the R720?

    The installation went fine, but I am seeing a couple issues. They could simply be flaky hardware, but the Dell diagnostics pass, and since Dell doesn't suport Ubuntu on servers, I doubt I'll get any help from them.

    1. For testing, I was connected only to eth0 with 1000BT. After 5-15 minutes the link went down (no link lights) and I was unable to get it back by plugging/unplugging or ifdown/ifup. After a reboot it worked again for a short time. I observed this both using DHCP and a static IP.

    2. The BIOS event log and diagnostic display show 2 errors, "Critical CPU1 Status: Processor sensor for CPU1, IERR was asserted" and "Critical PCIE Fatal Err: Critical Event sensor, bus fatal error [Bus 1 Device 0 Function 0] was asserted". These have only shown up one time even though I've seen problem 1 and rebooted several times.

    Before I pull a second server out of the box and repeat my testing, does anyopne have any experience or advice?

    Thanks!

    -Mark

  2. #2
    Join Date
    Nov 2008
    Location
    S.H.I.E.L.D. 6-1-6
    Beans
    Hidden!
    Distro
    Ubuntu Development Release

    Re: 12.04.1 on Dell R720?

    "Critical CPU1 Status: Processor sensor for CPU1, IERR was asserted" indicates that the processor has activated its IERR pin, which then goes onto the second error, which is the real problem.

    The second error refers to a problem in the PCI-E port.

    Have you checked/reseated any cards plugged in PCI-E ports?
    Tried the cards in other computers?
    Don't waste your energy trying to change opinions ... Do your thing, and don't care if they like it.

  3. #3
    Join Date
    Sep 2012
    Beans
    5

    Re: 12.04.1 on Dell R720?

    I think it's more likely a software problem than an actual hardware problem, but that's why I'm asking for feedback from anyone else with an R720. There is only one actual PCI-E card, which is the currently unused second network card. There's also permanently attached devices like the RAID controller and onboard network interfaces which I suspect are behind the PCI-E controller.

    Thanks,

    -Mark

  4. #4
    Join Date
    Sep 2012
    Beans
    4

    Re: 12.04.1 on Dell R720?

    Hi

    Just so you know, you're no the only one with these issues.
    Last week i had both cpu's reporting a failure in one of my brand new R720's. I have just had it pop up again on another R720 this morning.

    I believe the issue with your network is due to the processor, as from what i read on a review site earlier with the latest architecture the pci-e channels are linked to the processor and so if your take out a cpu you lose half your pci-e channels.

    So it stands to reason that the fault in the cpu is linked to the pci-e.

    Also as i am about to contact dell again, they have been very helpful after running their centos 6 live diagnostics image and uploading the results they had a guy here in around 14hours to fix it.

    I believe the issue with the processors is poor handling of the intel turbo boost 2 technology and dells power management. Each time i have had a processor error on me it has printed on the screen cpu power error messages which from other forums i have discovered id due to turbo boost.
    Rebooting while it is using the turbo boost seems to be the trigger, though why it was using turbo boost with a peak load of 0.4 according to 'top' is a mystery to me.

    Anyway good luck with this and please keep everyone updated if you get any closer to discovering the root problem.

  5. #5
    Join Date
    Sep 2012
    Beans
    4

    Re: 12.04.1 on Dell R720?

    Have Just spoken to dell technical support again and they are telling me that there is currently a problem running ubuntu on the R720 and that there should be a bios update within the next month to resolve the issue.

    And they are currently working with me to provide a temporary work around.

    My understanding of the issue is that ubuntu is not cleaning up the threads correctly and so the processors are reporting being over loaded, thus throwing the error.

    I will update with results from my testing of the work around.

  6. #6
    Join Date
    Nov 2008
    Location
    Boston MetroWest
    Beans
    16,326

    Re: 12.04.1 on Dell R720?

    Just curious, but is this true just for Ubuntu, or is it true for all Linux distros? What about the versions of RedHat and Novell that Dell itself distributes? Do they also have this problem? If not, I'd give CentOS a try and see if that works any better.
    If you ask for help, do not abandon your request. Please have the courtesy to check for responses and thank the people who helped you.

    Blog · Linode System Administration Guides · Android Apps for Ubuntu Users

  7. #7
    Join Date
    Sep 2012
    Beans
    5

    Re: 12.04.1 on Dell R720?

    Thank you for your replies.

    Dell is coming out today to replace the motherboard, PCI-E riser, and network daughter card on the server that showed the CPU iERR fault. It doesn't sound that that's really going to fix anything.

    The main symptom I see is that the link on eth0 will go down. This seems to be asyncronous to any CPU iERR or PCI-E faults caught by the Lifecycle Controller. We used the machine heavily for three days and I test transferred 500 GB of data and it worked fine. Then out of the blue eth0 had no link.

    SeijiSensei, I have no personal experiencer with CentOS on the R720, but Dell is shipping and supporting the RedHat and SuSE Enterprise versions, so I suspect CentOS (a rebuild of RedHat) should be fine. This also leads me to believe whatever shortcomings the hardware has have been fixed in software in those distributions. I need a Debian release for this application, so I can't use the supported OSes.

  8. #8
    Join Date
    Sep 2012
    Beans
    4

    Re: 12.04.1 on Dell R720?

    Well as promised, minor update.

    Dell asked me to change a few setting and so far they are working.

    The first two things are bios settings
    under server profiles set a custom profile with power management to maximum performance and the C-State to disabled.

    Then within ubuntu Blacklist the sb_edac driver
    /etc/modprobe.d/blacklist.conf

    a quick google shows sb_edac is a sandy bridge controller which i have no idea why it would be in use on a xeon but if thats what was being used to handle things then no wonder it caused problems.

    Oh and as macptcom says Dell officially support RedHat and Suse so i'm certain they will have the correct drivers as standard.

    And i am also unfortunately tied to debian distros as months of developement time has gone into my work and switching over to centos would put me so far back that my boss would probably kill this project.
    Last edited by norris900; September 13th, 2012 at 06:21 PM.

  9. #9
    Join Date
    Sep 2012
    Beans
    5

    Re: 12.04.1 on Dell R720?

    Thank you norris900 for the information here and in PM.

    My 2 servers were for a customer order that had to ship, so now we'll have to deal with the fixes in the field. I have one more server on order and hopefully it shows the same symptoms so I can prove the fixes work for me.

    Did you also see the network interfaces drop? I was seeing that before and more frequently than the CPU iERRs.

  10. #10
    Join Date
    Sep 2012
    Beans
    4

    Re: 12.04.1 on Dell R720?

    Hi all,

    I have also had several problems with R720 and R620 and ubuntu 12.04.1. We have had two motherboards replaced and one CPU. Probably a waste, but that was what Dell recommended.

    I just tried reinstalling one server over PXE, and it didn't boot up. Don't know what was on the screen because the server is in our datacenter (and we didn't buy the idrac enterprise).. Then I tried applying the blacklist for sb_edac, and voila! It boots now and have done so a couple of times (except for when I just completely wasted my bonding interfaces with a broken configuration).

    I hope the performance won't be affected, but we are already a couple of weeks late with our project and didn't want to switch to RHEL or CentOS since our whole Chef setup is configured for Ubuntu.

    Update: I have not noticed any interface drops, not on our "integrated" 4x1gb intel cards or our pcie-connected intel x520 card.
    Last edited by deltaprojects; September 14th, 2012 at 01:34 PM.

Page 1 of 2 12 LastLast

Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •