Troubleshooting OpenStack Compute services
OpenStack Compute services are complex, and being able to diagnose faults is an essential part of ensuring the smooth running of the services. Fortunately, OpenStack Compute provides some tools to help with this process, along with the tools provided by Ubuntu to help identify issues.
How to achieve it…
Troubleshooting OpenStack Compute services can be a complex issue, but working through problems methodically and logically will help you reach a satisfactory outcome. Carry out the following suggested steps when encountering the different problems presented.
Steps for when you cannot ping or SSH to an instance
- While launching instances, we specify a security group. If none is specified, a security group is named default. These mandatory security groups ensure security is enabled by default in our cloud environment, and as such, we must explicitly state that we require the ability to ping our instances and SSH to them. For such a basic activity, it is common to add these abilities to the default security group.
- Network issues may prevent us from accessing our cloud instances. First, check that the compute instances are able to forward packets from the public interface to the bridged interface. Use the following command for the same:
sysctl -A | grep ip_forward
- ipv4.ip_forward should be set to 1. If it isn’t, check that /etc/sysctl.conf has the following option uncommented. Use the following command for it:
- Then, run the following command to pick up the change:
sudo sysctl -p
- Other network issues could be routing issues. Check that we can communicate with the OpenStack Compute nodes from our client and that any routing to get to these instances has the correct entries.
- We may have a conflict with IPv6, if IPv6 isn’t required. If this is the case, try adding –use_ipv6=false to your /etc/nova/nova.conf file, and restart the nova-compute and nova-network services. We may also need to disable IPv6 in the operating system, which can be achieved using something like the following line in /etc/modprobe.d/ipv6.conf:
install ipv6 /bin/true
- If using OpenStack Neutron, check the status of the neutron services on the host and the correct IP namespace is being used (see Troubleshooting OpenStack Networking).
- Reboot your host.
Methods for viewing the Instance Console log
- When using the command line, issue the following commands:
nova console-log INSTANCE_ID
nova console-log ee0cb5ca-281f-43e9-bb40-42ffddcb09cd
- When using Horizon, carry out the following steps:
- Navigate to the list of instance and select an instance.
- You will be taken to an Overview. Along the top of the Overview screen is a Log tab. This is the console log for the instance.
- When viewing the logs directly on a nova-compute host, look for the following file:
The console logs are owned by root, so only an administrator can do this. They are placed at: var/lib/nova/instances/<instance_id>/console.log.
Instance fails to download meta information
If an instance fails to communicate to download the extra information that can be supplied to the instance meta-data, we can end up in a situation where the instance is up, but you’re unable to log in, as the SSH key information is injected using this method.
Viewing the console log will show output like in the following screenshot:
If you are not using Neutron, ensure the following:
- nova-api is running on the Controller host (in a multi_host environment, ensure there’s a nova-api-metadata and a nova-network package installed and running on the Compute host).
- Perform the following iptables check on the Compute node:
sudo iptables -L -n -t nat
We should see a line in the output like in the following screenshot:
- If not, restart your nova-network services and check again.
- Sometimes there are multiple copies of dnsmasq running, which can cause this issue. Ensure that there is only one instance of dnsmasq running:
ps -ef | grep dnsmasq
This will bring back two process entries, the parent dnsmasq process and a spawned child (verify by the PIDs). If there are any other instances of dnsmasq running, kill the dnsmasq processes. When killed, restart nova-network, which will spawn dnsmasq again without any conflicting processes.
If you are using Neutron:
The first place to look is in the /var/log/quantum/metadata_agent.log on the Network host. Here you may see Python stack traces that could indicate a service isn’t running correctly. A connection refused message may appear here suggesting the metadata agent running on the Network host is unable to talk to the Metadata service on the Controller host via the Metadata Proxy service (also running on the Network host).
The metadata service runs on port 8775 on our Controller host, so checking that in running involves checking that the port is open and it’s running the metadata service. To do this on the Controller host, run the following:
sudo netstat -antp | grep 8775
This will bring back the following output if everything is OK:
tcp 0 0 0.0.0.0:8775 0.0.0.0:* LISTEN
If nothing is returned, check that the nova-api service is running and if not, start it.
Instance launches; stuck at Building or Pending
Sometimes, a little patience is needed before assuming the instance has not booted, because the image is copied across the network to a node that has not seen the image before. At other times, though, if the instance has been stuck in booting or a similar state for longer than normal, it indicates a problem. The first place to look will be for errors in the logs. A quick way of doing this is from the controller server and by issuing the following command:
sudo nova-manage logs errors
A common error that is usually present is usually related to AMQP being unreachable. Generally, these errors can be ignored unless, that is, you check the time stamp and these errors are currently appearing. You tend to see a number of these messages related to when the services first started up, so look at the timestamp before reaching conclusions.
This command brings back any log line with the ERROR as log level, but you will need to view the logs in more detail to get a clearer picture.
A key log file, when troubleshooting instances that are not booting properly, will be available on the controller host at /var/log/nova/nova-scheduler.log. This file tends to produce the reason why an instance is stuck in Building state. Another file to view further information will be on the compute host at /var/log/nova/nova-compute.log. Look here at the time you launch the instance. In a busy environment, you will want to tail the log file and parse for the instance ID.
Check /var/log/nova/nova-network.log (for Nova Network) and /var/log/quantum/*.log (for Neutron), for any reason why instances aren’t being assigned IP addresses. It could be issues around DHCP preventing address allocation or quotas being reached.
Error codes such as 401, 403, 500
The majority of the OpenStack services are web services, meaning the responses from the services are well defined.
40X: This refers to a service that is up,but responding to an event that is produced by some user error. For example, a 401 is, an authentication failure, so check the credentials used when accessing the service.
500: These errors mean a connecting service is unavailable or has caused an error that has caused the service to interpret a response to cause a failure. Common problems here are services that have not started properly, so check for running services.
If all avenues have been exhausted when troubleshooting your environment, reach out to the community, using the mailing list or IRC, where there is a raft of people willing to offer their time and assistance. See the Getting help from the community recipe at the end of this topic, for more information.
Listing all instances across all hosts
From the OpenStack controller node, you can execute the following command to get a list of the running instances in the environment:
sudo nova-manage vm list
To view all instances across all tenants, as a user with an admin role executes the following command:
nova list --all-tenants
These commands are useful in identifying any failed instances and the host on which it is running. You can then investigate further.
How it works…
Troubleshooting OpenStack Compute problems can be quite complex, but looking in the right places can help solve some of the more common problems. Unfortunately, like troubleshooting any computer system, there isn’t a single command that can help identify all the problems that you may encounter, but OpenStack provides some tools to help you identify some problems. Having an understanding of managing servers and networks will help troubleshoot a distributed cloud environment such as OpenStack.
There’s more than one place where you can go to identify the issues, as they can stem from the environment to the instances themselves. Methodically working your way through the problems though will help lead you to a resolution.