OpenStack Compute Services are complex, and being able to diagnose faults is an essential part of ensuring the smooth running of the services. Fortunately, OpenStack Compute provides some tools to help with this process, along with the tools provided by Ubuntu to help identify issues.
If you would like to become an OpenStack Certified professional, then visit Mindmajix - A Global online training platform: " OpenStack Certification Training Course ". This course will help you to achieve excellence in this domain. |
Troubleshooting OpenStack Compute services can be a complex issue, but working through problems methodically and logically will help you reach a satisfactory outcome. Carry out the following suggested steps when encountering the different problems presented.
sysctl -A | grep ip_forward
net.ipv4.ip_forward=1
net.ipv4.ip_forward=1
net.ipv4.ip_forward=1
net.ipv4.ip_forward=1
nova console-log INSTANCE_ID
For example:
nova console-log ee0cb5ca-281f-43e9-bb40-42ffddcb09cd
The console logs are owned by the root, so only an administrator can do this. They are placed at: var/lib/nova/instances//console.log.
If an instance fails to communicate to download the extra information that can be supplied to the instance meta-data, we can end up in a situation where the instance is up, but you’re unable to log in, as the SSH key information is injected using this method.
Viewing the console log will show output like in the following screenshot:
If you are not using Neutron, ensure the following:
sudo iptables -L -n -t nat
We should see a line in the output like in the following screenshot:
ps -ef | grep dnsmasq
This will bring back two process entries, the parent dnsmasq process and a spawned child (verify by the PIDs). If there are any other instances of dnsmasq running, kill the dnsmasq processes. When killed, restart nova-network, which will spawn dnsmasq again without any conflicting processes.
If you are using Neutron:
The first place to look is in the /var/log/quantum/metadata_agent.log on the Network host. Here you may see Python stack traces that could indicate a service isn’t running correctly. A connection refused message may appear here suggesting the metadata agent running on the Network host is unable to talk to the Metadata service on the Controller host via the Metadata Proxy service (also running on the Network host).
The metadata service runs on port 8775 on our Controller host, so checking that in running involves checking that the port is open and it’s running the metadata service. To do this on the Controller host, run the following:
sudo netstat -antp | grep 8775
This will bring back the following output if everything is OK:
If nothing is returned, check that the nova-api service is running and if not, start it.
Sometimes, a little patience is needed before assuming the instance has not booted, because the image is copied across the network to a node that has not seen the image before. At other times, though, if the instance has been stuck in booting or a similar state for longer than normal, it indicates a problem. The first place to look will be for errors in the logs. A quick way of doing this is from the controller server and by issuing the following command:
sudo nova-manage logs errors
A common error that is usually present is usually related to AMQP being unreachable. Generally, these errors can be ignored unless, that is, you check the time stamp and these errors are currently appearing. You tend to see a number of these messages related to when the services first started up, so look at the timestamp before reaching conclusions.
This command brings back any log line with the ERROR as log level, but you will need to view the logs in more detail to get a clearer picture.
A key log file, when troubleshooting instances that are not booting properly, will be available on the controller host at /var/log/nova/nova-scheduler.log. This file tends to produce the reason why an instance is stuck in Building state. Another file to view further information will be on the compute host at /var/log/nova/nova-compute.log. Look here at the time you launch the instance. In a busy environment, you will want to tail the log file and parse for the instance ID.
Check /var/log/nova/nova-network.log (for Nova Network) and /var/log/quantum/*.log (for Neutron), for any reason why instances aren’t being assigned IP addresses. It could be issues around DHCP preventing address allocation or quotas being reached.
The majority of the OpenStack services are web services, meaning the responses from the services are well defined.
40X: This refers to a service that is up,but responding to an event that is produced by some user error. For example, a 401 is, an authentication failure, so check the credentials used when accessing the service.
500: These errors mean a connecting service is unavailable or has caused an error that has caused the service to interpret a response to cause a failure. Common problems here are services that have not started properly, so check for running services.
If all avenues have been exhausted when troubleshooting your environment, reach out to the community, using the mailing list or IRC, where there is a raft of people willing to offer their time and assistance. See the Getting help from the community recipe at the end of this topic, for more information.
From the OpenStack controller node, you can execute the following command to get a list of the running instances in the environment:
sudo nova-manage vm list
To view all instances across all tenants, as a user with an admin role executes the following command:
nova list --all-tenants
These commands are useful in identifying any failed instances and the host on which it is running. You can then investigate further.
Troubleshooting OpenStack Compute problems can be quite complex, but looking in the right places can help solve some of the more common problems. Unfortunately, like troubleshooting any computer system, there isn’t a single command that can help identify all the problems that you may encounter, but OpenStack provides some tools to help you identify some problems. Having an understanding of managing servers and networks will help troubleshoot a distributed cloud environment such as OpenStack.
There’s more than one place where you can go to identify the issues, as they can stem from the environment to the instances themselves. Methodically working your way through the problems though will help lead you to a resolution.
Name | Dates | |
---|---|---|
OpenStack Training | Sep 21 to Oct 06 | View Details |
OpenStack Training | Sep 24 to Oct 09 | View Details |
OpenStack Training | Sep 28 to Oct 13 | View Details |
OpenStack Training | Oct 01 to Oct 16 | View Details |
Ravindra Savaram is a Technical Lead at Mindmajix.com. His passion lies in writing articles on the most popular IT platforms including Machine learning, DevOps, Data Science, Artificial Intelligence, RPA, Deep Learning, and so on. You can stay up to date on all these technologies by following him on LinkedIn and Twitter.