Nutanix : Non-disruptive Physical Memory Upgrade

Background

After a couple of years of service, your loved cluster is becoming old, there is no better way to refresh him than providing a big more space for VMs and services by upgrading the nodes' memory. I did that few days ago and now the cluster is like brand new.

Let's do this !

So, for the sake of compliance, conformity and risk, I would like to share the official Nutanix procedure with you, so you know the real background. The official procedure is located here.

Important to mention at this stage that the below procedure is applying to AHV environments. If you are running either vmware or Hyper-V, there are some additional steps described in the official Nutanix documentation.

My cluster is a 5 years old 1050 with 3 nodes. Originally shipped with 128GB RAM. It was becoming slow and I wanted to test Calm so forget it with that so low memory figures. I decided to upgrade it to 256 GB per node.


This is the status of the cluster before upgrade

High level procedure

  • Identify the required memory modules to add
  • On each node, one node a a time :
    • set the node in maintenance mode, it will evacuate the VMs to other nodes
    • power off the CVM
    • power off the Node
    • remove the node from the chassis
    • install the memory module
    • re-insert the node in the chassis
    • power up the node
    • exit maintenance mode
    • wait for VMs relocation
    • Start next node (loop through all nodes in the cluster)
  • Confirm your cluster memory has been increased

Ideally, NCC all test just before and after to compare that we are in good shape.

Set the node in maintenance mode

nutanix@NTNX-B-CVM:192.168.x.x:~$ acli host.enter_maintenance_mode 192.168.x.x wait=true
EnterMaintenanceMode: pending
EnterMaintenanceMode: complete

At this stage, no more VMs are running on the host we just set into maintenance mode.

Power off the CVM of the host we are upgrading

nutanix@NTNX-B-CVM:192.168.x.x:~$ cvm_shutdown -P now
2020-04-01 09:52:22 INFO zookeeper_session.py:143 cvm_shutdown is attempting to connect to Zookeeper
2020-04-01 09:52:22 INFO lcm_genesis.py:217 Rpc to [localhost] for LCM [LcmFramework.is_lcm_operation_in_progress] is successful
2020-04-01 09:52:22 INFO cvm_shutdown:157 No upgrade was found to be in progress on the cluster
2020-04-01 09:52:22 INFO cvm_shutdown:84 Acquired shutdown token successfully
2020-04-01 09:52:22 INFO cvm_shutdown:104 Validating command arguments.
2020-04-01 09:52:22 INFO cvm_shutdown:107 Executing cmd: sudo shutdown -k -P now

2020-04-01 09:52:23 INFO cvm_shutdown:92 Setting up storage traffic forwarding
2020-04-01 09:52:23 WARNING genesis_utils.py:118 Deprecated: use util.cluster.info.get_factory_config() instead
2020-04-01 09:52:23 INFO genesis_utils.py:2825 Verifying if route is set for 192.168.x.x
2020-04-01 09:52:23 INFO genesis_utils.py:2830 HA Route is not yet set for 192.168.x.x
2020-04-01 09:52:26 INFO genesis_utils.py:2830 HA Route is not yet set for 192.168.x.x
Write failed: Broken pipe

Note : the CVM is not affecting running workloads, in fact if any of the CVMs are down, we can still access the workloads.

Shutting down the host

[root@xxxx-KVM-B ~]# shutdown -h now

Broadcast message from root@xxxxx-KVM-B
        (/dev/pts/0) at 7:52 ...

At this stage, you can remove the node from the chassis and start populating the memory slots.


Sexy 24 DIMM modules (384 GB RAM)


This is the node, fully populated. Next time I would like to upgrade, I will have to remove all the DIMM and replace by bigger model. But, this will not happen.

Once the node has been restarted, ssh into it to check if the CVM has been started

[root@xxxx-KVM-A ~]# virsh list 
 Id    Name                           State
----------------------------------------------------
 1     xxxx-KVM-A-CVM                 running


Now, we can try to ssh into the CVM. If it works, we can check the cluster status. If it does not work, you can try to start the CVM manually using this command : virsh start cvm_name.

Checking cluster status 

nutanix@xxxxxxx-B-CVM:192.168.x.x:~$ cluster status
2020-04-01 10:03:49 INFO zookeeper_session.py:143 cluster is attempting to connect to Zookeeper
2020-04-01 10:03:49 INFO cluster:2712 Executing action status on The state of the cluster: start
Lockdown mode: Disabled

[...]
        CVM: 192.168.x.x Up
                                Zeus   UP       [5069, 5106, 5107, 5112, 5121, 5147]
                           Scavenger   UP       [6568, 6599, 6600, 6601]
                       SSLTerminator   UP       [8879, 8975, 8976, 8977]
                      SecureFileSync   UP       [8882, 8920, 8921, 8922]
                              Medusa   UP       [9519, 9558, 9559, 9563, 9669]
                  DynamicRingChanger   UP       [9780, 9832, 9833, 9884]
                              Pithos   UP       [9784, 9852, 9853, 9873]
                              Mantle   UP       [9789, 9870, 9871, 9894]
                            Stargate   UP       [11305, 11516, 11517, 11758, 11763]
                          InsightsDB   UP       [13255, 13422, 13423, 13549]
                InsightsDataTransfer   UP       [13319, 13477, 13478, 13542, 13543, 13545, 13546]
                               Ergon   UP       [13363, 13824, 13825, 13826]
                             Cerebro   UP       [13452, 13583, 13584, 13818]
                             Chronos   UP       [13499, 13695, 13696, 13813]
                             Curator   UP       [13575, 13737, 13738, 13898]
                              Athena   UP       [13625, 14014, 14015, 14016]
                               Prism   UP       [14748, 14901, 14902, 15052, 15055, 15063, 15142, 15143, 15144]
                                 CIM   UP       [14847, 15018, 15019, 15100]
                        AlertManager   UP       [14886, 15332, 15333, 15471]
                            Arithmos   UP       [14936, 15110, 15111, 15335]
                             Catalog   UP       [14965, 15472, 15473, 15474]
                           Acropolis   UP       [15161, 15378, 15379, 15380]
                               Uhura   UP       [15290, 15496, 15497, 15498]
                                Snmp   UP       [15485, 15591, 15592, 15594]
                    SysStatCollector   UP       [15545, 15659, 15660, 15662]
                   NutanixGuestTools DOWN       []
                          MinervaCVM DOWN       []
                       ClusterConfig DOWN       []
                             Mercury DOWN       []
                         APLOSEngine DOWN       []
                               APLOS DOWN       []
                               Lazan DOWN       []
                              Delphi DOWN       []
                                Flow DOWN       []
                             Anduril DOWN       []
                               XTrim DOWN       []
                       ClusterHealth DOWN       []

[...]
2020-04-01 10:03:52 INFO cluster:2863 Success!

In the above screen, not all services have been started. Once every services are up and running, we can exit maintenance mode.

nutanix@ xxxxxxx:192.168.x.x:~$ acli host.exit_maintenance_mode 192.168.x.x

In parallel to this, you can check Prism Element, all VMs are being re-located at their original place.



This is the status of the cluster after upgrade

Immediately after this, I have upgraded my CVMs to 32GB of RAM and enabled Erasure coding and capacity Dedup.

Now, I can use Calm and other features ;)


Hope this help...


Comments

What's hot ?

Wallbox : Get The Most Of It (with API)

RClone : Mount Google Drive on any File System