Itzik Gur - 05.10.202320231005

Join our community of 1,000+ IT professionals, and receive tech tips and updates once a week.

Ansible Automation Platform Troubleshooting

To troubleshoot communication between the controller nodes and execution nodes in Ansible Automation Platform, you can follow these steps:

Check the network connectivity and firewall settings between the nodes. You can use tools like ping, traceroute, telnet, nc, etc. to test the network reachability and latency. You can also use the ansible -m ping command to test the Ansible connectivity between the nodes.
Check the receptor configuration and status on each node. You can use the receptorctl command to view and manage the receptor mesh network. You can also use the receptorctl status command to see the list of nodes, connections, and work types in the mesh
Check the logs and metrics of the receptor service on each node. You can use tools like journalctl, tail, grep, etc. to view and filter the logs.
Check the Ansible Automation Platform web UI and API for any errors or warnings related to the node registration, grouping, or health.

Verify the Receptor

Receptor is a networking layer that provides a mechanism for the Ansible Platform to communicate with execution nodes (formerly known as managed nodes). When working with the newer versions of Ansible Automation Platform, Receptor serves as the underlying communication backbone.

Here are some steps and commands you can use to troubleshoot Receptor communication between an Ansible Controller and execution nodes:

Check Receptor Status. On the Ansible Controller, you can check the status of the Receptor using:

receptorctl --socket /var/run/awx-receptor/receptor.sock status

Example output:

Node ID: aap01.example.net  
 Version: 1.4.1  
 System CPU Count: 2  
 System Memory MiB: 7761  
  
 
 Connection                          Cost  
 aap02.example.net 1  
 aap03.example.net 1  
 aap01gcp.example.net        1  
  
 
 Known Node                          Known Connections  
 aap01.example.net aap02.example.net: 1 aap03.example.net: 1 aap01gcp.example.net: 1  
 aap02.example.net aap01.example.net: 1 aap03.example.net: 1 aap01gcp.example.net: 1  
 aap03.example.net aap01.example.net: 1 aap02.example.net: 1 aap01gcp.example.net: 1  
 aap01gcp.example.net        aap01.example.net: 1 aap02.example.net: 1 aap03.example.net: 1  
  
 
 Route                               Via  
 aap02.example.net aap02.example.net  
 aap03.example.net aap03.example.net  
 aap01gcp.example.net        aap01gcp.example.net  
  
 
 Node                                Service   Type       Last Seen             Tags  
 aap01.example.net control   StreamTLS  2023-09-13 14:26:18   {'type': 'Control Service'}  
 aap01gcp.example.net        control   StreamTLS  2023-09-13 14:25:54   {'type': 'Control Service'}  
 aap03.example.net control   StreamTLS  2023-09-13 14:25:24   {'type': 'Control Service'}  
 aap02.example.net control   StreamTLS  2023-09-13 14:25:54   {'type': 'Control Service'}  
  
 
 Node                                Secure Work Types  
 aap01.example.net local, kubernetes-runtime-auth, kubernetes-incluster-auth  
 aap01gcp.example.net        ansible-runner  
 aap03.example.net local, kubernetes-runtime-auth, kubernetes-incluster-auth  
 aap02.example.net local, kubernetes-runtime-auth, kubernetes-incluster-auth

Review Receptor Logs. Logs can be found in the system’s journal. You can view them using:

journalctl -u receptor

Verify mesh level communication. From the Ansible Controller, you can try pinging the execution node using the receptor command to see if it’s reachable:

[root@aap01 receptor]# receptorctl --socket /var/run/awx-receptor/receptor.sock ping aap02.example.net  
 Reply from aap02.example.net in 811.135µs  
 Reply from aap02.example.net in 814.871µs  
 Reply from aap02.example.net in 852.096µs  
 Reply from aap02.example.net in 848.816µs  
 [root@aap01 receptor]# receptorctl --socket /var/run/awx-receptor/receptor.sock ping aap01gcp.example.net  
 Reply from aap01gcp.example.net in 4.143774ms  
 Reply from aap01gcp.example.net in 4.049415ms  
 Reply from aap01gcp.example.net in 7.643543ms  
 Reply from aap01gcp.example.net in 4.131193ms

Repeat the test from the execution node to the controller nodes:

[root@aap01gcp ~]# receptorctl --socket /var/run/awx-receptor/receptor.sock ping aap01.example.net  
 Reply from aap01.example.net in 3.998636ms  
 Reply from aap01.example.net in 4.07025ms  
 Reply from aap01.example.net in 4.053869ms  
 Reply from aap01.example.net in 4.43546ms  
  
 
 [root@aap01gcp ~]# receptorctl --socket /var/run/awx-receptor/receptor.sock ping aap02.example.net  
 Reply from aap02.example.net in 3.131143ms  
 Reply from aap02.example.net in 3.118211ms  
 Reply from aap02.example.net in 3.35466ms  
 Reply from aap02.example.net in 3.120776ms

The receptor’s configuration should not change. If you suspect that the configuration has changed, review the receptor configuration. Ensure that the receptor configuration files on both the Controllers and the execution node(s) are correctly configured. The configuration is in /etc/receptor/receptor.conf. Review this file for any misconfigurations. The following configuration file has been configured during the installation:

[root@aap01 receptor]# cat receptor.conf  
 ---  
 - node:  
     id: aap01.example.net  
     firewallrules:  
       - action: "reject"  
         tonode: "aap01.example.net"  
         toservice: "control"  
  
 
 - work-signing:  
     privatekey: /etc/receptor/work_private_key.pem  
     tokenexpiration: 1m  
  
 
 - work-verification:  
     publickey: /etc/receptor/work_public_key.pem  
  
 
  
 
 # Log Level  
 - log-level: info  
  
 
 # Control Service  
 - control-service:  
     service: control  
     filename: /var/run/awx-receptor/receptor.sock  
     permissions: 0660  
     tls: tls_server  
  
 
 # TLS  
 - tls-server:  
     name: tls_server  
     cert: /etc/receptor/tls/aap01.example.net.crt  
     key: /etc/receptor/tls/aap01.example.net.key  
     clientcas: /etc/receptor/tls/ca/mesh-CA.crt  
     requireclientcert: true  
  
 
 - tls-client:  
     name: tls_client  
     cert: /etc/receptor/tls/aap01.example.net.crt  
     key: /etc/receptor/tls/aap01.example.net.key  
     rootcas: /etc/receptor/tls/ca/mesh-CA.crt  
     insecureskipverify: false  
  
 
  
 
 # Peers  
 - tcp-peer:  
     address: aap02.example.net:27199  
     redial: true  
     tls: tls_client  
 - tcp-peer:  
     address: aap03.example.net:27199  
     redial: true  
     tls: tls_client  
 - tcp-peer:  
     address: aap01gcp.example.net:27199  
     redial: true  
     tls: tls_client  
  
 
 # Work-commands  
 - work-command:  
     worktype: local  
     command: /var/lib/awx/venv/awx/bin/ansible-runner  
     params: worker  
     allowruntimeparams: true  
     verifysignature: true  
  
 
 - work-kubernetes:  
     worktype: kubernetes-runtime-auth  
     authmethod: runtime  
     allowruntimeauth: true  
     allowruntimepod: true  
     allowruntimeparams: true  
     verifysignature: true  
  
 
 - work-kubernetes:  
     worktype: kubernetes-incluster-auth  
     authmethod: incluster  
     allowruntimeauth: true  
     allowruntimepod: true  
     allowruntimeparams: true  
     verifysignature: true

Restart Receptor. Sometimes simply restarting the Receptor can help resolve minor issues:

  systemctl restart receptor

Ensure there aren’t any firewall rules or networking issues preventing communication. Check firewall settings on both the Controller and execution nodes to ensure the required ports for Receptor are open. The receptor is using port 27199. You can use the receptor status or ping commands to verify the communication. If the receptor ping does not work, that might indicate routing or firewall issues. Things to check:
- firewall-cmd –list-all (if the firewalld is used on the host)
- firewall configuration at the network level
- routing – use system level trouceroute command or receptor level traceroute.

[root@aap01 receptor]# receptorctl --socket /var/run/awx-receptor/receptor.sock traceroute aap01gcp.example.net  
 0: aap01.example.net in 249.524µs  
 1: aap01gcp.example.net in 4.00961ms

Receptor TLS/SSL Issues

When verifying TLS/SSL configurations, especially in the context of Receptor communications in Ansible Automation Platform, you can follow the steps below to ensure everything is in order:

Check Certificate Expiry. You can use the openssl tool to inspect a certificate’s details, including its expiration date. Check if ‘Not Before’ and ‘Not After’ are correct and the certificate is still valid:

[root@aap01 receptor]# openssl x509 -in /etc/receptor/tls/aap01.example.net.crt -noout -text  
 Certificate:  
     Data:  
         Version: 3 (0x2)  
         Serial Number: 1676526891 (0x63edc52b)  
         Signature Algorithm: sha256WithRSAEncryption  
         Issuer: CN = Ansible Automation Controller Nodes Mesh ROOT CA  
         Validity  
             Not Before: Feb 16 05:54:51 2023 GMT  
             Not After : Feb  6 05:54:32 2033 GMT  
         Subject: CN = aap01.example.net  
         Subject Public Key Info:  
             Public Key Algorithm: rsaEncryption  
                 RSA Public-Key: (4096 bit)  
                 Modulus:  
                     87:82:34:3d:3d:3b:7a:c7:bd:7f:0d:4f:b6:cf:ea:  
                     26:36:01:94:b5:87:02:b4:4c:00:98:ba:6b:4c:6f:  
                     7f:2a:4b:f7:6f:b9:50:af:43:80:ea:f7:4b:b5:68:  
                     e2:75:de:93:e0:df:dd:90:72:5e:45:8d:5a:4e:35:  
                     b7:12:3c:2f:f2:c4:22:1f:87:d8:ca:6f:ae:84:1e:  
                     2e:f8:01:4c:a2:22:fd:fd:4c:2b:ea:31:b8:a7:5b:  
                     d0:8d:08:4f:a7:58:25:b3:6d:15:11:67:b7:b1:51:  
                     da:39:ed:61:3a:77:15:9a:cd:e2:4e:4c:ee:97:17:  
                     31:cf:13:df:e8:5a:ee:8e:35:3e:3c:60:dc:7e:10:  
                     c2:23:2f:37:c8:72:75:aa:79:26:c1:c0:83:76:33:  
                     a2:a8:63:de:e8:cd:07:46:3d:66:3b:3e:63:71:ed:  
                     a9:d9:7e:ba:79:db:ab:dd:66:a0:6f:27:88:79:7a:  
                     51:cc:fe:76:1e:94:d4:ac:dc:8c:d6:70:56:67:cc:  
                     47:4c:ba:58:e3:e9:50:c3:69:73:b6:a0:5e:e0:1a:  
                     ef:6e:91:15:08:41:b5:9c:d4:e5:2b:97:cf:db:22:  
                     53:48:fa:50:28:a8:6e:17:3f:dd:0b:4e:b1:0e:6a:  
                     dc:28:6d:ec:eb:5f:16:f0:eb:33:ac:d2:f9:60:2a:  
                     ba:02:44:89:b5:80:3e:d9:0f:21:08:cd:3e:e2:f4:  
                     4d:04:11:8f:f6:d2:af:23:ed:9f:5c:a2:87:2a:52:  
                     81:c0:f0:81:64:7f:47:13:2c:18:40:9b:88:25:47:  
                     3a:d4:a8:5c:43:26:27:7f:7f:1f:40:4f:7f:1d:38:  
                     00:fa:de:47:c6:16:58:a5:54:a7:86:cc:e3:df:43:  
                     72:40:d2:09:4b:47:77:05:4b:9f:23:d9:62:ce:70:  
                     35:0c:05:09:1a:79:d2:9b:0d:6f:d4:6e:db:97:89:  
                     1a:0b:fb:ed:ae:c8:2d:fb:7c:8d:b3:47:38:78:36:  
                     5a:0b:b5:37:9d:f8:de:d0:81:6f:76:bf:75:30:40:  
                     b1:6c:71  
                 Exponent: 65537 (0x10001)  
         X509v3 extensions:  
             X509v3 Key Usage: critical  
                 Digital Signature  
             X509v3 Extended Key Usage:  
                 TLS Web Client Authentication, TLS Web Server Authentication  
             X509v3 Authority Key Identifier:  
  
               keyid:B3:59:0F:79:B2:41:78:5C:7D:31:9F:95:DD:98:0F:6E:B6:7B:C8:FC  
  
 
             X509v3 Subject Alternative Name:  
                 DNS:aap01.example.net, IP Address:10.29.32.222, othername:<unsupported>  
     Signature Algorithm: sha256WithRSAEncryption

Ensure private key matches the certificate. You can check that the private key corresponds to the certificate:

openssl x509 -noout -modulus -in /path/to/certificate.crt | openssl md5   
 openssl rsa -noout -modulus -in /path/to/private.key | openssl md5

The MD5 hashes from both commands should match.

[root@aap01 receptor]# openssl x509 -noout -modulus -in /etc/receptor/tls/aap01.example.net.crt | openssl md5  
 (stdin)= 8baa6271452a553a492cb79f70586100  
 [root@aap01 receptor]# openssl rsa -noout -modulus -in /etc/receptor/tls/aap01.example.net.key | openssl md5  
 (stdin)= 8baa6271452a553a492cb79f70586100

Test TLS Connection. You can use openssl to manually establish a connection to the Receptor service to check the TLS handshake

openssl s_client -connect receptor-node-ip:port -CAfile /path/to/ca.crt  
 [root@aap01 receptor]# openssl s_client -connect aap02.example.net:27199 -CAfile /etc/receptor/tls/ca/mesh-CA.crt  
 CONNECTED(00000003)  
 depth=1 CN = Ansible Automation Controller Nodes Mesh ROOT CA  
 verify return:1  
 depth=0 CN = aap02.example.net  
 verify return:1  
 ---  
 Certificate chain  
  0 s:CN = aap02.example.net  
    i:CN = Ansible Automation Controller Nodes Mesh ROOT CA  
 ---  
 Server certificate

Troubleshooting Execution Environments

Troubleshooting execution environments (EE) on Ansible Controller involves several steps, as execution environments play a vital role in encapsulating resources needed to run playbooks. Here’s a structured approach:

Verify EE Image:
- Ensure that the EE image exists and is correctly specified. Use podman to list images:

podman images

Run EE Image Manually:
- Try to run the image manually to see if there are any issues:

podman run -it --rm <image_name_or_id> /bin/bash

This will allow you to enter the EE container. You can inspect its content and check if all required tools, libraries, and Ansible collections or roles are present.

Verify Resources:
- Ensure that the host running the Ansible Controller has sufficient resources (CPU, memory, disk space). Running out of resources can cause unexpected issues with execution environments. For example, verify if there is free space on /var/lib/awx file system. Note that the controller keeps its container images under /var/lib/awx/.local

df -h

Ansible Configuration:
- Check the Ansible configuration inside the EE. The ansible.cfg file should have the right parameters. If you’re inside the EE container, you can display it using:

cat /etc/ansible/ansible.cfg

EE Specific Errors:
- If your playbooks refer to custom modules, roles, or collections, ensure they are available within the EE. Remember, EEs should encapsulate all the required dependencies to run the playbook.

Network Access:
- Ensure the EE has appropriate network access to reach target nodes, any required repositories, or other resources.

Security Contexts and Privileges:
- If the Controller runs on a system with SELinux, ensure that your EE is granted the necessary contexts or privileges to operate.

Dependencies and Pipelines:
- If your EE has dependencies on external systems or services, ensure they are operational. This could include SCM repositories, credential stores, or third-party services.
Validate Playbooks:
- It might be an issue with the playbook rather than the EE. Try running the playbook with verbosity:

ansible-playbook -vvv your_playbook.yml

This might give more insights into where and why the playbook is failing

Custom EE Builds:
- If you’re building your own EE images, ensure that the build process completes without errors. Check the Containerfile for any issues.

Controller Configuration:
- Ensure that Ansible Controller itself is correctly configured to use EEs. This includes verifying paths, image names, or any other settings specific to EEs.

By following these steps, you can systematically identify and resolve issues with execution environments on the Ansible Controller. If you’re still encountering problems, consult the official Ansible documentation or reach out to Red Hat support.

Troubleshooting Communication between AAP Controller and Hub

Basic Connectivity:
- Check if the Controller node can reach Automation Hub:

ping <automation_hub_host>

Check Ports:
- Automation Hub typically listens on port 443 for SSL traffic. Ensure this port is open and reachable:

ssh -vp 443 <automation_hub_host>

View Logs on Controller Node:
- For the Controller services:

journalctl -u automation-controller

Also, refer to the logs located in /var/log/tower/.

View Logs on Automation Hub:
- For the Pulp services, which back Automation Hub:

journalctl -u pulpcore-worker@*.service

Check the Pulp logs typically located in /var/log/pulp/.

Verify SSL/TLS:
- Ensure certificates are correctly configured, valid, and trusted on both ends.
- If self-signed certificates are in use, ensure they’re added to the trust store on the Controller nodes.

API Authentication:
- The controller communicates with Automation Hub using token-based authentication. Ensure tokens are valid and not expired.

Proxy Issues:
- If a proxy is in use, ensure its correctly configured. Check proxy logs for denied requests.

Firewall Rules:
- Check if there are firewall rules blocking communication between the Controller and Automation Hub.

DNS Issues:
- Ensure DNS resolution works correctly from Controller to Automation Hub:

nslookup <automation_hub_host>

Network Configuration:
- Check for changes in network configurations that might have affected communication.
Automation Hub Health:
- Check that Automation Hub’s services and processes are running and healthy.
Database Connectivity for Automation Hub:
- Ensure the database backend for Automation Hub (Pulp) can connect without issues.
Controller Version Compatibility:
- Ensure Controller and Automation Hub versions are compatible.
Updates/Patches:
- Check for any updates or patches that might address communication issues.

SAML authentication issues between Ansible Controller and Microsoft Azure AD

Configuration Check:
- Ensure that the configuration on both Ansible Controller and Azure AD side matches. This includes Entity IDs, Assertion Consumer Service URLs, and other key SAML attributes.
Certificate Validation:
- Ensure the certificate used for signing the SAML assertion in Azure AD is still valid and has not expired. Note that the on AAP side you are using the certificate used for tower.cert; tower.key
- The same certificate should be configured in Ansible Controller’s SAML settings.
- If you’re using encrypted assertions, make sure you’ve provided the correct
Assertion Content:
- Using tools like SAML Tracer for Firefox or the SAML Chrome Panel for Chrome can help you capture the SAML assertion sent from Azure AD to Ansible Controller.
- Check if the attributes in the assertion match what’s expected by Ansible Controller.
Azure AD Configuration:
- Ensure that the user trying to authenticate is part of the Azure AD user group that’s allowed to log into Ansible Controller.
- Confirm that the SAML configuration on Azure AD side, especially the claim rules, are correctly set up to provide the necessary claims to Ansible Controller.
Endpoint URLs:
- Ensure that the SSO URL and Entity ID on both Azure AD and Ansible Controller match. Any discrepancies here will cause authentication to fail.
Clock Skew:
- SAML assertions are time sensitive. Ensure that the system clocks on both Ansible Controller and Azure AD (Azure’s end will typically be accurate) are synchronised.
Role and Attribute Mapping:
- In Ansible Controller’s SAML settings, ensure that you’ve correctly mapped Azure AD attributes to Controller’s user attributes and roles.
Network Issues:
- Confirm there are no network issues preventing Ansible Controller from reaching Azure AD or vice versa.
Session and Cookie Issues:
- Clear cookies and session data in your browser or try a different browser to rule out any session-specific issues.

AWX-MANAGE Tool

Configuration:
- View current AAP configuration:

awx-manage print_settings

Check for pending migrations. This command can be used after the upgrade, to verify the progress of the DB migrations

awx-manage showmigrations

Database:
- Check DB (if the database responds, version etc)

awx-manage check_db

Clean up old job history (this will remove old jobs and preserve space):

awx-manage cleanup_jobs

Verify the instances (controller nodes and execution nodes)

awx-manage list_instances

AWX Tool

Remember to add the following after each command:

--conf.host https://aap_fqdn --conf.username admin --conf.password ‘password’ --conf.insecure

Get Configuration:

awx config

List All Jobs:

awx jobs list

Retrieve Details of a Specific Job:

awx jobs get <job_id>

List All Projects:

awx projects list

Retrieve Details of a Specific Project:

awx projects get <project_id>

List All Inventories:

awx inventories list

Retrieve Details of a Specific Inventory:

awx inventories get <inventory_id>

List All Hosts within an Inventory:

awx hosts list --inventory <inventory_id>

Ad-hoc Commands:

awx ad_hoc_commands create --inventory <inventory_id> --module-name <module_name> --module-args "<module_arguments>"

List All Job Templates:

awx job_templates list

Launch a Job Template:

awx job_templates launch <template_id> --monitor

List All Users:

awx users list

Retrieve Details of a Specific User:

awx users get <user_id>

Create a New User:

awx users create --username <username> --password <password> --email <email>

List All Organisations:

awx organisations list

Ping the AWX API:

awx ping

Ansible Private Automation Hub

Check Pulp Worker Status: Use systemctl to check the status of the Pulp workers.

systemctl status pulpcore-worker@*.service

Review Pulp Worker Logs: The logs can give insight into any issues the workers might be facing.

journalctl -u pulpcore-worker@*.service

Restart Pulp Workers: If a worker seems to be stuck or malfunctioning, you can restart it.

systemctl restart pulpcore-worker@*.service

Clean Up Old Tasks: Sometimes, cleaning up old completed tasks can help.

pulpcore-manager handle-artifact-checksums

Database Issues: Check the status of the PostgreSQL database that Pulp uses. If there are connectivity issues, Pulp tasks will fail.

Disk Space: Ensure that there’s enough disk space where Pulp stores its content and artefacts. Running out of space can cause tasks to fail.

Check the Number of Workers: If you’re dealing with a large number of tasks or large-sized content, you might need to scale up the number of Pulp workers.

SELinux: SELinux policy denials can interfere with the operation of services, including Pulp. Check for any relevant AVC denials.

ausearch -m AVC -ts recent

Check Connectivity to External Repositories: If you’re having issues syncing from external repositories, ensure network connectivity, and that firewalls or proxies aren’t blocking the connection.

Are you facing challenges with Ansible Automation Platform troubleshooting? Our team of experienced experts is here to help you resolve any issues swiftly and efficiently. Whether it’s communication problems, configuration errors, or any other technical difficulties, we have the knowledge and expertise to assist you.

Don’t let automation issues slow down your operations. Contact us today, and let’s work together to ensure your Ansible Automation Platform runs smoothly and efficiently.