To troubleshoot communication between the controller nodes and execution nodes in Ansible Automation Platform, you can follow these steps:
- Check the network connectivity and firewall settings between the nodes. You can use tools like ping, traceroute, telnet, nc, etc. to test the network reachability and latency. You can also use the ansible -m ping command to test the Ansible connectivity between the nodes.
- Check the receptor configuration and status on each node. You can use the receptorctl command to view and manage the receptor mesh network. You can also use the receptorctl status command to see the list of nodes, connexions, and work types in the mesh
- Check the logs and metrics of the receptor service on each node. You can use tools like journalctl, tail, grep, etc. to view and filter the logs.
- Check the Ansible Automation Platform web UI and API for any errors or warnings related to the node registration, grouping, or health.
Verify the Receptor
Receptor is a networking layer that provides a mechanism for the Ansible Platform to communicate with execution nodes (formerly known as managed nodes). When working with the newer versions of Ansible Automation Platform, Receptor serves as the underlying communication backbone.
Here are some steps and commands you can use to troubleshoot Receptor communication between an Ansible Controller and execution nodes:
- Check Receptor Status. On the Ansible Controller, you can check the status of the Receptor using:
receptorctl --socket /var/run/awx-receptor/receptor.sock status
Example output:
Node ID: aap01.example.net
Version: 1.4.1
System CPU Count: 2
System Memory MiB: 7761
Connexion Cost
aap02.example.net 1
aap03.example.net 1
aap01gcp.example.net 1
Known Node Known Connexions
aap01.example.net aap02.example.net: 1 aap03.example.net: 1 aap01gcp.example.net: 1
aap02.example.net aap01.example.net: 1 aap03.example.net: 1 aap01gcp.example.net: 1
aap03.example.net aap01.example.net: 1 aap02.example.net: 1 aap01gcp.example.net: 1
aap01gcp.example.net aap01.example.net: 1 aap02.example.net: 1 aap03.example.net: 1
Route Via
aap02.example.net aap02.example.net
aap03.example.net aap03.example.net
aap01gcp.example.net aap01gcp.example.net
Node Service Type Last Seen Tags
aap01.example.net control StreamTLS 2023-09-13 14:26:18 {'type': 'Control Service'}
aap01gcp.example.net control StreamTLS 2023-09-13 14:25:54 {'type': 'Control Service'}
aap03.example.net control StreamTLS 2023-09-13 14:25:24 {'type': 'Control Service'}
aap02.example.net control StreamTLS 2023-09-13 14:25:54 {'type': 'Control Service'}
Node Secure Work Types
aap01.example.net local, kubernetes-runtime-auth, kubernetes-incluster-auth
aap01gcp.example.net ansible-runner
aap03.example.net local, kubernetes-runtime-auth, kubernetes-incluster-auth
aap02.example.net local, kubernetes-runtime-auth, kubernetes-incluster-auth
Review Receptor Logs. Logs can be found in the system’s journal. You can view them using:
journalctl -u receptor
Verify mesh level communication. From the Ansible Controller, you can try pinging the execution node using the receptor command to see if it’s reachable:
[root@aap01 receptor]# receptorctl --socket /var/run/awx-receptor/receptor.sock ping aap02.example.net
Reply from aap02.example.net in 811.135µs
Reply from aap02.example.net in 814.871µs
Reply from aap02.example.net in 852.096µs
Reply from aap02.example.net in 848.816µs
[root@aap01 receptor]# receptorctl --socket /var/run/awx-receptor/receptor.sock ping aap01gcp.example.net
Reply from aap01gcp.example.net in 4.143774ms
Reply from aap01gcp.example.net in 4.049415ms
Reply from aap01gcp.example.net in 7.643543ms
Reply from aap01gcp.example.net in 4.131193ms
Repeat the test from the execution node to the controller nodes:
[root@aap01gcp ~]# receptorctl --socket /var/run/awx-receptor/receptor.sock ping aap01.example.net
Reply from aap01.example.net in 3.998636ms
Reply from aap01.example.net in 4.07025ms
Reply from aap01.example.net in 4.053869ms
Reply from aap01.example.net in 4.43546ms
[root@aap01gcp ~]# receptorctl --socket /var/run/awx-receptor/receptor.sock ping aap02.example.net
Reply from aap02.example.net in 3.131143ms
Reply from aap02.example.net in 3.118211ms
Reply from aap02.example.net in 3.35466ms
Reply from aap02.example.net in 3.120776ms
- The receptor’s configuration should not change. If you suspect that the configuration has changed, review the receptor configuration. Ensure that the receptor configuration files on both the Controllers and the execution node(s) are correctly configured. The configuration is in /etc/receptor/receptor.conf. Review this file for any misconfigurations. The following configuration file has been configured during the installation:
[root@aap01 receptor]# cat receptor.conf
---
- node:
id: aap01.example.net
firewallrules:
- action: "reject"
tonode: "aap01.example.net"
toservice: "control"
- work-signing:
privatekey: /etc/receptor/work_private_key.pem
tokenexpiration: 1m
- work-verification:
publickey: /etc/receptor/work_public_key.pem
# Log Level
- log-level: info
# Control Service
- control-service:
service: control
filename: /var/run/awx-receptor/receptor.sock
permissions: 0660
tls: tls_server
# TLS
- tls-server:
name: tls_server
cert: /etc/receptor/tls/aap01.example.net.crt
key: /etc/receptor/tls/aap01.example.net.key
clientcas: /etc/receptor/tls/ca/mesh-CA.crt
requireclientcert: true
- tls-client:
name: tls_client
cert: /etc/receptor/tls/aap01.example.net.crt
key: /etc/receptor/tls/aap01.example.net.key
rootcas: /etc/receptor/tls/ca/mesh-CA.crt
insecureskipverify: false
# Peers
- tcp-peer:
address: aap02.example.net:27199
redial: true
tls: tls_client
- tcp-peer:
address: aap03.example.net:27199
redial: true
tls: tls_client
- tcp-peer:
address: aap01gcp.example.net:27199
redial: true
tls: tls_client
# Work-commands
- work-command:
worktype: local
command: /var/lib/awx/venv/awx/bin/ansible-runner
params: worker
allowruntimeparams: true
verifysignature: true
- work-kubernetes:
worktype: kubernetes-runtime-auth
authmethod: runtime
allowruntimeauth: true
allowruntimepod: true
allowruntimeparams: true
verifysignature: true
- work-kubernetes:
worktype: kubernetes-incluster-auth
authmethod: incluster
allowruntimeauth: true
allowruntimepod: true
allowruntimeparams: true
verifysignature: true
- Restart Receptor. Sometimes simply restarting the Receptor can help resolve minor issues:
systemctl restart receptor
- Ensure there aren’t any firewall rules or networking issues preventing communication. Check firewall settings on both the Controller and execution nodes to ensure the required ports for Receptor are open. The receptor is using port 27199. You can use the receptor status or ping commands to verify the communication. If the receptor ping does not work, that might indicate routing or firewall issues. Things to check:
- firewall-cmd –list-all (if the firewalld is used on the host)
- firewall configuration at the network level
- routing – use system level trouceroute command or receptor level traceroute.
[root@aap01 receptor]# receptorctl --socket /var/run/awx-receptor/receptor.sock traceroute aap01gcp.example.net
0: aap01.example.net in 249.524µs
1: aap01gcp.example.net in 4.00961ms
Receptor TLS/SSL Issues
When verifying TLS/SSL configurations, especially in the context of Receptor communications in Ansible Automation Platform, you can follow the steps below to ensure everything is in order:
- Check Certificate Expiry. You can use the openssl tool to inspect a certificate’s details, including its expiration date. Check if ‘Not Before’ and ‘Not After’ are correct and the certificate is still valid:
[root@aap01 receptor]# openssl x509 -in /etc/receptor/tls/aap01.example.net.crt -noout -text
Certificate:
Data:
Version: 3 (0x2)
Serial Number: 1676526891 (0x63edc52b)
Signature Algorithm: sha256WithRSAEncryption
Issuer: CN = Ansible Automation Controller Nodes Mesh ROOT CA
Validity
Not Before: Feb 16 05:54:51 2023 GMT
Not After : Feb 6 05:54:32 2033 GMT
Subject: CN = aap01.example.net
Subject Public Key Info:
Public Key Algorithm: rsaEncryption
RSA Public-Key: (4096 bit)
Modulus:
87:82:34:3d:3d:3b:7a:c7:bd:7f:0d:4f:b6:cf:ea:
26:36:01:94:b5:87:02:b4:4c:00:98:ba:6b:4c:6f:
7f:2a:4b:f7:6f:b9:50:af:43:80:ea:f7:4b:b5:68:
e2:75:de:93:e0:df:dd:90:72:5e:45:8d:5a:4e:35:
b7:12:3c:2f:f2:c4:22:1f:87:d8:ca:6f:ae:84:1e:
2e:f8:01:4c:a2:22:fd:fd:4c:2b:ea:31:b8:a7:5b:
d0:8d:08:4f:a7:58:25:b3:6d:15:11:67:b7:b1:51:
da:39:ed:61:3a:77:15:9a:cd:e2:4e:4c:ee:97:17:
31:cf:13:df:e8:5a:ee:8e:35:3e:3c:60:dc:7e:10:
c2:23:2f:37:c8:72:75:aa:79:26:c1:c0:83:76:33:
a2:a8:63:de:e8:cd:07:46:3d:66:3b:3e:63:71:ed:
a9:d9:7e:ba:79:db:ab:dd:66:a0:6f:27:88:79:7a:
51:cc:fe:76:1e:94:d4:ac:dc:8c:d6:70:56:67:cc:
47:4c:ba:58:e3:e9:50:c3:69:73:b6:a0:5e:e0:1a:
ef:6e:91:15:08:41:b5:9c:d4:e5:2b:97:cf:db:22:
53:48:fa:50:28:a8:6e:17:3f:dd:0b:4e:b1:0e:6a:
dc:28:6d:ec:eb:5f:16:f0:eb:33:ac:d2:f9:60:2a:
ba:02:44:89:b5:80:3e:d9:0f:21:08:cd:3e:e2:f4:
4d:04:11:8f:f6:d2:af:23:ed:9f:5c:a2:87:2a:52:
81:c0:f0:81:64:7f:47:13:2c:18:40:9b:88:25:47:
3a:d4:a8:5c:43:26:27:7f:7f:1f:40:4f:7f:1d:38:
00:fa:de:47:c6:16:58:a5:54:a7:86:cc:e3:df:43:
72:40:d2:09:4b:47:77:05:4b:9f:23:d9:62:ce:70:
35:0c:05:09:1a:79:d2:9b:0d:6f:d4:6e:db:97:89:
1a:0b:fb:ed:ae:c8:2d:fb:7c:8d:b3:47:38:78:36:
5a:0b:b5:37:9d:f8:de:d0:81:6f:76:bf:75:30:40:
b1:6c:71
Exponent: 65537 (0x10001)
X509v3 extensions:
X509v3 Key Usage: critical
Digital Signature
X509v3 Extended Key Usage:
TLS Web Client Authentication, TLS Web Server Authentication
X509v3 Authority Key Identifier:
keyid:B3:59:0F:79:B2:41:78:5C:7D:31:9F:95:DD:98:0F:6E:B6:7B:C8:FC
X509v3 Subject Alternative Name:
DNS:aap01.example.net, IP Address:10.29.32.222, othername:<unsupported>
Signature Algorithm: sha256WithRSAEncryption
- Ensure private key matches the certificate. You can check that the private key corresponds to the certificate:
openssl x509 -noout -modulus -in /path/to/certificate.crt | openssl md5
openssl rsa -noout -modulus -in /path/to/private.key | openssl md5
The MD5 hashes from both commands should match.
[root@aap01 receptor]# openssl x509 -noout -modulus -in /etc/receptor/tls/aap01.example.net.crt | openssl md5
(stdin)= 8baa6271452a553a492cb79f70586100
[root@aap01 receptor]# openssl rsa -noout -modulus -in /etc/receptor/tls/aap01.example.net.key | openssl md5
(stdin)= 8baa6271452a553a492cb79f70586100
- Test TLS Connexion. You can use openssl to manually establish a connexion to the Receptor service to check the TLS handshake
openssl s_client -connect receptor-node-ip:port -CAfile /path/to/ca.crt
[root@aap01 receptor]# openssl s_client -connect aap02.example.net:27199 -CAfile /etc/receptor/tls/ca/mesh-CA.crt
CONNECTED(00000003)
depth=1 CN = Ansible Automation Controller Nodes Mesh ROOT CA
verify return:1
depth=0 CN = aap02.example.net
verify return:1
---
Certificate chain
0 s:CN = aap02.example.net
i:CN = Ansible Automation Controller Nodes Mesh ROOT CA
---
Server certificate
Troubleshooting Execution Environments
Troubleshooting execution environments (EE) on Ansible Controller involves several steps, as execution environments play a vital role in encapsulating resources needed to run playbooks. Here’s a structured approach:
- Verify EE Image:
- Ensure that the EE image exists and is correctly specified. Use podman to list images:
podman images
- Run EE Image Manually:
- Try to run the image manually to see if there are any issues:
podman run -it --rm <image_name_or_id> /bin/bash
This will allow you to enter the EE container. You can inspect its content and check if all required tools, libraries, and Ansible collections or roles are present.
- Verify Resources:
- Ensure that the host running the Ansible Controller has sufficient resources (CPU, memory, disk space). Running out of resources can cause unexpected issues with execution environments. For example, verify if there is free space on /var/lib/awx file system. Note that the controller keeps its container images under /var/lib/awx/.local
df -h
- Ansible Configuration:
- Check the Ansible configuration inside the EE. The ansible.cfg file should have the right parameters. If you’re inside the EE container, you can display it using:
cat /etc/ansible/ansible.cfg
- EE Specific Errors:
- If your playbooks refer to custom modules, roles, or collections, ensure they are available within the EE. Remember, EEs should encapsulate all the required dependencies to run the playbook.
- Network Access:
- Ensure the EE has appropriate network access to reach target nodes, any required repositories, or other resources.
- Security Contexts and Privileges:
- If the Controller runs on a system with SELinux, ensure that your EE is granted the necessary contexts or privileges to operate.
- Dependencies and Pipelines:
- If your EE has dependencies on external systems or services, ensure they are operational. This could include SCM repositories, credential stores, or third-party services.
- Validate Playbooks:
- It might be an issue with the playbook rather than the EE. Try running the playbook with verbosity:
ansible-playbook -vvv your_playbook.yml
- This might give more insights into where and why the playbook is failing
- Custom EE Builds:
- If you’re building your own EE images, ensure that the build process completes without errors. Check the Containerfile for any issues.
- Controller Configuration:
- Ensure that Ansible Controller itself is correctly configured to use EEs. This includes verifying paths, image names, or any other settings specific to EEs.
By following these steps, you can systematically identify and resolve issues with execution environments on the Ansible Controller. If you’re still encountering problems, consult the official Ansible documentation or reach out to Red Hat support.
Troubleshooting Communication between AAP Controller and Hub
- Basic Connectivity:
- Check if the Controller node can reach Automation Hub:
ping <automation_hub_host>
- Check Ports:
- Automation Hub typically listens on port 443 for SSL traffic. Ensure this port is open and reachable:
ssh -vp 443 <automation_hub_host>
- View Logs on Controller Node:
- For the Controller services:
journalctl -u automation-controller
- Also, refer to the logs located in /var/log/tower/.
- View Logs on Automation Hub:
- For the Pulp services, which back Automation Hub:
journalctl -u pulpcore-worker@*.service
- Check the Pulp logs typically located in /var/log/pulp/.
- Verify SSL/TLS:
- Ensure certificates are correctly configured, valid, and trusted on both ends.
- If self-signed certificates are in use, ensure they’re added to the trust store on the Controller nodes.
- API Authentication:
- The controller communicates with Automation Hub using token-based authentication. Ensure tokens are valid and not expired.
- Proxy Issues:
- If a proxy is in use, ensure its correctly configured. Check proxy logs for denied requests.
- Firewall Rules:
- Check if there are firewall rules blocking communication between the Controller and Automation Hub.
- DNS Issues:
- Ensure DNS resolution works correctly from Controller to Automation Hub:
nslookup <automation_hub_host>
- Network Configuration:
- Check for changes in network configurations that might have affected communication.
- Automation Hub Health:
- Check that Automation Hub’s services and processes are running and healthy.
- Database Connectivity for Automation Hub:
- Ensure the database backend for Automation Hub (Pulp) can connect without issues.
- Controller Version Compatibility:
- Ensure Controller and Automation Hub versions are compatible.
- Updates/Patches:
- Check for any updates or patches that might address communication issues.
SAML authentication issues between Ansible Controller and Microsoft Azure AD
- Configuration Check:
- Ensure that the configuration on both Ansible Controller and Azure AD side matches. This includes Entity IDs, Assertion Consumer Service URLs, and other key SAML attributes.
- Certificate Validation:
- Ensure the certificate used for signing the SAML assertion in Azure AD is still valid and has not expired. Note that the on AAP side you are using the certificate used for tower.cert; tower.key
- The same certificate should be configured in Ansible Controller’s SAML settings.
- If you’re using encrypted assertions, make sure you’ve provided the correct
- Assertion Content:
- Using tools like SAML Tracer for Firefox or the SAML Chrome Panel for Chrome can help you capture the SAML assertion sent from Azure AD to Ansible Controller.
- Check if the attributes in the assertion match what’s expected by Ansible Controller.
- Azure AD Configuration:
- Ensure that the user trying to authenticate is part of the Azure AD user group that’s allowed to log into Ansible Controller.
- Confirm that the SAML configuration on Azure AD side, especially the claim rules, are correctly set up to provide the necessary claims to Ansible Controller.
- Endpoint URLs:
- Ensure that the SSO URL and Entity ID on both Azure AD and Ansible Controller match. Any discrepancies here will cause authentication to fail.
- Clock Skew:
- SAML assertions are time sensitive. Ensure that the system clocks on both Ansible Controller and Azure AD (Azure’s end will typically be accurate) are synchronised.
- Role and Attribute Mapping:
- In Ansible Controller’s SAML settings, ensure that you’ve correctly mapped Azure AD attributes to Controller’s user attributes and roles.
- Network Issues:
- Confirm there are no network issues preventing Ansible Controller from reaching Azure AD or vice versa.
- Session and Cookie Issues:
- Clear cookies and session data in your browser or try a different browser to rule out any session-specific issues.
AWX-MANAGE Tool
- Configuration:
- View current AAP configuration:
awx-manage print_settings
- Check for pending migrations. This command can be used after the upgrade, to verify the progress of the DB migrations
awx-manage showmigrations
- Database:
- Check DB (if the database responds, version etc)
awx-manage check_db
- Clean up old job history (this will remove old jobs and preserve space):
awx-manage cleanup_jobs
- Verify the instances (controller nodes and execution nodes)
awx-manage list_instances
AWX Tool
Remember to add the following after each command:
--conf.host https://aap_fqdn --conf.username admin --conf.password ‘password’ --conf.insecure
- Get Configuration:
awx config
- List All Jobs:
awx jobs list
- Retrieve Details of a Specific Job:
awx jobs get <job_id>
- List All Projects:
awx projects list
- Retrieve Details of a Specific Project:
awx projects get <project_id>
- List All Inventories:
awx inventories list
- Retrieve Details of a Specific Inventory:
awx inventories get <inventory_id>
- List All Hosts within an Inventory:
awx hosts list --inventory <inventory_id>
- Ad-hoc Commands:
awx ad_hoc_commands create --inventory <inventory_id> --module-name <module_name> --module-args "<module_arguments>"
- List All Job Templates:
awx job_templates list
- Launch a Job Template:
awx job_templates launch <template_id> --monitor
- List All Users:
awx users list
- Retrieve Details of a Specific User:
awx users get <user_id>
- Create a New User:
awx users create --username <username> --password <password> --email <email>
- List All Organisations:
awx organisations list
- Ping the AWX API:
awx ping
Ansible Private Automation Hub
- Check Pulp Worker Status: Use systemctl to check the status of the Pulp workers.
systemctl status pulpcore-worker@*.service
- Review Pulp Worker Logs: The logs can give insight into any issues the workers might be facing.
journalctl -u pulpcore-worker@*.service
- Restart Pulp Workers: If a worker seems to be stuck or malfunctioning, you can restart it.
systemctl restart pulpcore-worker@*.service
- Clean Up Old Tasks: Sometimes, cleaning up old completed tasks can help.
pulpcore-manager handle-artifact-checksums
- Database Issues: Check the status of the PostgreSQL database that Pulp uses. If there are connectivity issues, Pulp tasks will fail.
- Disk Space: Ensure that there’s enough disk space where Pulp stores its content and artefacts. Running out of space can cause tasks to fail.
- Check the Number of Workers: If you’re dealing with a large number of tasks or large-sized content, you might need to scale up the number of Pulp workers.
- SELinux: SELinux policy denials can interfere with the operation of services, including Pulp. Check for any relevant AVC denials.
ausearch -m AVC -ts recent
- Check Connectivity to External Repositories: If you’re having issues syncing from external repositories, ensure network connectivity, and that firewalls or proxies aren’t blocking the connexion.
Are you facing challenges with Ansible Automation Platform troubleshooting? Our team of experienced experts is here to help you resolve any issues swiftly and efficiently. Whether it’s communication problems, configuration errors, or any other technical difficulties, we have the knowledge and expertise to assist you.
Don’t let automation issues slow down your operations. Contact us today, and let’s work together to ensure your Ansible Automation Platform runs smoothly and efficiently.