Auto-Triage Troubleshooting

Thanks to Ido Raday for this amazing troubleshooting guide!

Introduction

This is an interactive guide for debugging the auto triage process.

There are several requirements:

  • Access to the machine

  • Basic knowledge in shell commands

For any question please reach out to the server team

Was The issue created?

The Auto triage process is triggered by the corresponding issue.

Before we debug the auto triage we first need to make sure that a new issue created.

We also need to keep the alert_id for future steps.

  • Enter the issues pages and find the correct issue.

  • Click on the relevant issue 

  • Enter the relevant issue → Overview

  • On the issue page and copy the alert_id from the address bar



  • If the issue was created - goto “Does the issue have Auto triage details?” section

  • If the issue was not created - goto “Verify the Auto triage process result” section

Does the issue have Auto triage details?

You can identify an issue with auto triage by the symbol next to the headline

  • If the Auto triage symbol appeared - Great! Close the document

  • If the Auto triage symbol did not appear - goto “Verify the Auto triage process result” section

The issue was not created!

If the issue was not created it can be due to several reasons:

  • There is already an active issue on the device

  • There is some issue with the data collection

  • There is some issue with the rule

Make sure to find the reason and only after fixing it continue to the next step

 

  • If the issue was created - goto “Does the issue have Auto triage details?” section

  • If you can’t manage to find the problem  - goto “Open ticket for Server team” section

 

 

Verify the Auto triage process result

Every issue should trigger event in the Auto triage service.

The auto triage will run the corresponding playbook (if available) and will send the result back to the server.

This step will check the result the server got from the playbook,

  • Enter psql and run the next commands (replace ALERT_ID with the correct one):

 

\x select * from automation_job where alert_id = 'ALERT_ID';
  • Copy the job_id as you will need it for the next steps



  • If the query returned empty result - goto “Is Auto Triage feature enabled?”

  • If the query returned success result - goto “Auto triage process result Was a success”

  • If the query returned Not_Available result - goto “Auto triage process result Was a Not_Available”

  • If the query returned error code FILE_IS_MISSING - goto “Auto triage process failed with error code FILE_IS_MISSING”

  • If the query returned empty result UNABLE_TO_GET_DEVICE_DATA- goto “Auto triage process failed with error code Unable_To_Get_Device_Data”

  • If the query returned empty result INSUFFICIENT_DATA_ACQUIRED- goto “Auto triage process failed with error code INSUFFICIENT_DATA_ACQUIRED”

  • If the query returned empty result PLAYBOOK_RUN_FAILED- goto “Auto triage process failed with error code PLAYBOOK_RUN_FAILED”

Is Auto Triage feature enabled?

It seems like the auto triage process did not run.

There are two ways to enable the Auto triage feature:

  • Server Application.conf (not recommended- this config is removed every install )

    • The path for the file is: /usr/share/indeni/conf/application.conf

    • The flag name is enable-automation-process

    • Using the following command you can check the flag status:
      less /usr/share/indeni/conf/application.conf | grep enable-automation-process 

  • Psql db

    • Enter the db by simply writing psql

    • Using the following command you can check the flag status:
      select * from configuration where key ='automation.enabled';

    • If no line returned - the feature is disabled

  • If the feature is disabled in both ways - goto “Enable the feature flag”

  • If the feature is enabled at least in one way - goto “Open ticket for Server team”

 

Enable the feature flag

  • Server Application.conf (not recommended)

    • Vi to /usr/share/indeni/conf/application.conf

    • Change the value next to enable-automation-process from false to true

    • Restart the server using imanage → 3

  • psql db

    • Enter the db by simply writing psql

    • Using the following commands you can check the flag status:

      • delete from configuration where key ='automation.enabled';

      • insert into configuration (key,value) values ('automation.enabled','true');



  • After creating a new alert - goto “Open ticket for Server team”



Auto triage process result Was a success

It seems like the auto triage process worked on the server-side.

If the issue does not show the alert it can be due to UI issue, please open a ticket to application team



Auto triage process result Was a Not_Available

It seems like the auto triage process did not find a playbook to run for the issue.

Search the triage process log for alert_id for example:

less /usr/share/indeni-services/logs/automation.log | grep  0c7968cd-8df9-4272-852a-ebd3bea2b130 -A20

Find the log block the related to the alert_id, for example:

2019-10-05 19:53:38,263 - INFO - automation_registration.py - New automation request, alert_id: 0c7968cd-8df9-4272-852a-ebd3bea2b130, device_id: 270d7888-ede5-419d-b968-ab45c8a08c07, rule_name: DeviceMonitoringSuspended, vendor_name: paloaltonetworks 2019-10-05 19:53:38,264 - INFO - playbook_catalog.py - Get playbook for rule: DeviceMonitoringSuspended  vendor: paloaltonetworks 2019-10-05 19:53:38,264 - INFO - playbook_catalog.py - playbook for rule: DeviceMonitoringSuspended  vendor: paloaltonetworks is None 2019-10-05 19:53:38,264 - INFO - automation_registration.py - New job created, job_id: d8dad379-c130-4135-9c1a-35ce52fd201d, alert_id: 0c7968cd-8df9-4272-852a-ebd3bea2b130, device_id: 270d7888-ede5-419d-b968-ab45c8a08c07, playbook_file: None

 

If the data does not as expected - goto “Open ticket for Server team”

Auto triage process failed with error code FILE_IS_MISSING

The error FILE_IS_MISSING indicated that the playbook file is missing. 

  • Open the catalog:
    less /usr/share/indeni-knowledge/stable/automation/playbooks/playbook_catalog.yaml 

  • Find the playbook that matchs the issue.

  • Make sure the playbook file exists

 

  • If the data is now displayed as expected - goto “Open ticket for Server team”

Auto triage process failed with error code Unable_To_Get_Device_Data

The error Unable_To_Get_Device_Data indicated that the Auto triage service did not manage to get the credentials from the server. Those cases can be when, for example, indeni only needs ssh credentials for interrogation but the playbook requires HTTP credentials

  • Using psql get the last credentials used by the device:
    select * from credential where id in (select credential_id from last_used_credentials where device_id ='YOUR DEVICE ID’');

  •  

  • If the data does not as expected - goto “Open ticket for Server team”

Auto triage process failed with error code INSUFFICIENT_DATA_ACQUIRED

The error INSUFFICIENT_DATA_ACQUIRED indicated that the Auto triage service did not manage to extract enough data for the server. This error will appear when:

  • The playbook conclusion is empty

  • The number of returned tasks is 0

This behavior will happen when the playbook exited in an unexpected way without throwing an exception.

 

  • In order to debug the ansible process - goto “Reading the ansible logs”

Auto triage process failed with error code PLAYBOOK_RUN_FAILED

The error PLAYBOOK_RUN_FAILED indicated that the ansible process failed while running the playbook

 

  • In order to debug the ansible process - goto “Reading the ansible logs”

Reading the ansible logs

The ansible logs are in /usr/share/indeni-services/logs/ansible

  • Enter the directory with the correct device-id 

  • Enter the directory - artifacts

  • Enter the directory with the correct job-id

  • Look at the ‘stdout’ file in this directory. This should give the full output of the Ansible run. E.g.,

    /usr/share/indeni-services/logs/ansible/<device_id>/artifacts/<job_id>/stdout
  • Inside there is usually the exception of the failure.

  • If the failure message in the log is not clear  - goto “Open ticket for Server team”

 

 

Open ticket for Server team

When opening a ticket please include:

  • Automation log file - less /usr/share/indeni-services/logs/automation.log

  • Server log file - less /usr/share/indeni/logs/indeni.log

  • Ansible artifact dir - /usr/share/indeni-services/logs/ansible

Also, make sure you write in details all the steps that you went and all the data you collected