Cisco UCS Log Fullness due to ECC Memory Errors

Greetings, everyone! I recently had a customer who was running into an issue where they were seeing the Cisco UCS System Event Log (SEL) fullness being reported within vCenter Server.

Upon looking at the host’s SEL Logs tab in UCS Manager, we could see that the SEL had filled up due to a significant number of ECC errors on a particular set of DIMMs. Typically, we could just clear the SEL and move on, but I’ve found that following these steps can not only clear the SEL, but may reset the ECC memory error state to help determine if a DIMM truly is flaky.

  1. Open your SSH client of choice and connect to the Cisco UCS Manager.


  2. Log in to UCS Manager. In this particular environment, the customer had to logon using their domain credentials in this format:
    ucs-DOMAIN\USERID


  3. Run these set of commands to connect to the particular blade (if applicable), reset the memory errors, and clear the SEL.

    In this example, connect to Chassis #3, Blade #2:
    scope server 3/2

    Then, reset all ECC memory errors being reported in the SEL:
    reset-all-memory-errors

    Commit the changes to UCS manager:
    commit-buffer

    The next step is to reset or clear the SEL:
    clear sel

    Again, commit the changes to UCS Manager:
    commit-buffer

  4. I believe the last step is optional, but in my experience, it didn’t hurt. Reset the CIMC, just to be safe.
    reset

    As usual, commit the changes:
    commit-buffer

    Doing so will drop any connection to the CIMC for that server (including the SSH session that was established earlier in this post).


  5. Ping or try to connect to the CIMC address after a few minutes to ensure connectivity and remote management.

And that’s pretty much all there is to it! Hopefully you found this post helpful. As always, thanks for stopping by!