Auto-Triage Troubleshooting
- 1 Introduction
- 2 Was The issue created?
- 3 Does the issue have Auto triage details?
- 4 The issue was not created!
- 5 Verify the Auto triage process result
- 6 Is Auto Triage feature enabled?
- 7 Enable the feature flag
- 8 Auto triage process result Was a success
- 9 Auto triage process result Was a Not_Available
- 10 Auto triage process failed with error code FILE_IS_MISSING
- 11 Auto triage process failed with error code Unable_To_Get_Device_Data
- 12 Auto triage process failed with error code INSUFFICIENT_DATA_ACQUIRED
- 13 Auto triage process failed with error code PLAYBOOK_RUN_FAILED
- 14 Reading the ansible logs
- 15 Open ticket for Server team
Thanks to Ido Raday for this amazing troubleshooting guide!
Introduction
This is an interactive guide for debugging the auto triage process.
There are several requirements:
Access to the machine
Basic knowledge in shell commands
For any question please reach out to the server team
Was The issue created?
The Auto triage process is triggered by the corresponding issue.
Before we debug the auto triage we first need to make sure that a new issue created.
We also need to keep the alert_id for future steps.
Enter the issues pages and find the correct issue.
Click on the relevant issue
Enter the relevant issue → Overview
On the issue page and copy the alert_id from the address bar
If the issue was created - goto “Does the issue have Auto triage details?” section
If the issue was not created - goto “Verify the Auto triage process result” section
Does the issue have Auto triage details?
You can identify an issue with auto triage by the symbol next to the headline
If the Auto triage symbol appeared - Great! Close the document
If the Auto triage symbol did not appear - goto “Verify the Auto triage process result” section
The issue was not created!
If the issue was not created it can be due to several reasons:
There is already an active issue on the device
There is some issue with the data collection
There is some issue with the rule
Make sure to find the reason and only after fixing it continue to the next step
If the issue was created - goto “Does the issue have Auto triage details?” section
If you can’t manage to find the problem - goto “Open ticket for Server team” section
Verify the Auto triage process result
Every issue should trigger event in the Auto triage service.
The auto triage will run the corresponding playbook (if available) and will send the result back to the server.
This step will check the result the server got from the playbook,
Enter psql and run the next commands (replace ALERT_ID with the correct one):
\x
select * from automation_job where alert_id = 'ALERT_ID';
Copy the job_id as you will need it for the next steps
If the query returned empty result - goto “Is Auto Triage feature enabled?”
If the query returned success result - goto “Auto triage process result Was a success”
If the query returned Not_Available result - goto “Auto triage process result Was a Not_Available”
If the query returned error code FILE_IS_MISSING - goto “Auto triage process failed with error code FILE_IS_MISSING”
If the query returned empty result UNABLE_TO_GET_DEVICE_DATA- goto “Auto triage process failed with error code Unable_To_Get_Device_Data”
If the query returned empty result INSUFFICIENT_DATA_ACQUIRED- goto “Auto triage process failed with error code INSUFFICIENT_DATA_ACQUIRED”
If the query returned empty result PLAYBOOK_RUN_FAILED- goto “Auto triage process failed with error code PLAYBOOK_RUN_FAILED”
Is Auto Triage feature enabled?
It seems like the auto triage process did not run.
There are two ways to enable the Auto triage feature:
Server Application.conf (not recommended- this config is removed every install )
The path for the file is: /usr/share/indeni/conf/application.conf
The flag name is enable-automation-process
Using the following command you can check the flag status:
less /usr/share/indeni/conf/application.conf | grep enable-automation-process
Psql db
Enter the db by simply writing psql
Using the following command you can check the flag status:
select * from configuration where key ='automation.enabled';If no line returned - the feature is disabled
If the feature is disabled in both ways - goto “Enable the feature flag”
If the feature is enabled at least in one way - goto “Open ticket for Server team”
Enable the feature flag
Server Application.conf (not recommended)
Vi to /usr/share/indeni/conf/application.conf
Change the value next to enable-automation-process from false to true
Restart the server using imanage → 3
psql db
Enter the db by simply writing psql
Using the following commands you can check the flag status:
delete from configuration where key ='automation.enabled';
insert into configuration (key,value) values ('automation.enabled','true');
After creating a new alert - goto “Open ticket for Server team”
Auto triage process result Was a success
It seems like the auto triage process worked on the server-side.
If the issue does not show the alert it can be due to UI issue, please open a ticket to application team
Auto triage process result Was a Not_Available
It seems like the auto triage process did not find a playbook to run for the issue.
Search the triage process log for alert_id for example:
less /usr/share/indeni-services/logs/automation.log | grep 0c7968cd-8df9-4272-852a-ebd3bea2b130 -A20
Find the log block the related to the alert_id, for example:
2019-10-05 19:53:38,263 - INFO - automation_registration.py - New automation request, alert_id: 0c7968cd-8df9-4272-852a-ebd3bea2b130, device_id: 270d7888-ede5-419d-b968-ab45c8a08c07, rule_name: DeviceMonitoringSuspended, vendor_name: paloaltonetworks
2019-10-05 19:53:38,264 - INFO - playbook_catalog.py - Get playbook for rule: DeviceMonitoringSuspended vendor: paloaltonetworks
2019-10-05 19:53:38,264 - INFO - playbook_catalog.py - playbook for rule: DeviceMonitoringSuspended vendor: paloaltonetworks is None
2019-10-05 19:53:38,264 - INFO - automation_registration.py - New job created, job_id: d8dad379-c130-4135-9c1a-35ce52fd201d, alert_id: 0c7968cd-8df9-4272-852a-ebd3bea2b130, device_id: 270d7888-ede5-419d-b968-ab45c8a08c07, playbook_file: None
If the data does not as expected - goto “Open ticket for Server team”
Auto triage process failed with error code FILE_IS_MISSING
The error FILE_IS_MISSING indicated that the playbook file is missing.
Open the catalog:
less /usr/share/indeni-knowledge/stable/automation/playbooks/playbook_catalog.yamlFind the playbook that matchs the issue.
Make sure the playbook file exists
If the data is now displayed as expected - goto “Open ticket for Server team”
Auto triage process failed with error code Unable_To_Get_Device_Data
The error Unable_To_Get_Device_Data indicated that the Auto triage service did not manage to get the credentials from the server. Those cases can be when, for example, indeni only needs ssh credentials for interrogation but the playbook requires HTTP credentials
Using psql get the last credentials used by the device:
select * from credential where id in (select credential_id from last_used_credentials where device_id ='YOUR DEVICE ID’');If the data does not as expected - goto “Open ticket for Server team”
Auto triage process failed with error code INSUFFICIENT_DATA_ACQUIRED
The error INSUFFICIENT_DATA_ACQUIRED indicated that the Auto triage service did not manage to extract enough data for the server. This error will appear when:
The playbook conclusion is empty
The number of returned tasks is 0
This behavior will happen when the playbook exited in an unexpected way without throwing an exception.
In order to debug the ansible process - goto “Reading the ansible logs”
Auto triage process failed with error code PLAYBOOK_RUN_FAILED
The error PLAYBOOK_RUN_FAILED indicated that the ansible process failed while running the playbook
In order to debug the ansible process - goto “Reading the ansible logs”
Reading the ansible logs
The ansible logs are in /usr/share/indeni-services/logs/ansible
Enter the directory with the correct device-id
Enter the directory - artifacts
Enter the directory with the correct job-id
Look at the ‘stdout’ file in this directory. This should give the full output of the Ansible run. E.g.,
/usr/share/indeni-services/logs/ansible/<device_id>/artifacts/<job_id>/stdout
Inside there is usually the exception of the failure.
If the failure message in the log is not clear - goto “Open ticket for Server team”
Open ticket for Server team
When opening a ticket please include:
Automation log file - less /usr/share/indeni-services/logs/automation.log
Server log file - less /usr/share/indeni/logs/indeni.log
Ansible artifact dir - /usr/share/indeni-services/logs/ansible
Also, make sure you write in details all the steps that you went and all the data you collected