This page is a Quick Reference containing a punch list of run anywhere (copy/paste) Splunk searches to help explore the data in the OpenMethods Splunk events. The use of some advanced techniques is intentional and the foundational explanation for the techniques will be covered in other articles.
1. Basic Internal Structure
In this section, we’ll tackle Splunk’s own internal structure and how it manages indexes, sources, storage, event sizes, and types, and we’ll spot check the indexes _introspection and _internal.
1.1 What is the list of all indexes even if there is zero data in the index?
...
Code Block |
---|
| eventcount summarize=false index=* index=_*
| dedup index
| rename count as countofevents
| fields index countofevents
| sort countofevents DESC |
...
Note: in order to conserve space on this page for the remaining queries,
- It is a best practice, when readability matters, to author Splunk queries per above with the pipe symbol as the first character of each new line. Instead, the queries will be compacted.
- Sample results will either not be included or will be converted to JSON and shown as a code block like this below (same results set as above).
Code Block | ||
---|---|---|
| ||
[{"index":"1905405642","countofevents":"main"},{"index":"23781423","countofevents":"\"_audit\""},{"index":"10004574","countofevents":"\"_internal\""},{"index":"1755528","countofevents":"\"_introspection\""},{"index":"85111","countofevents":"summary"},{"index":"24343","countofevents":"\"_telemetry\""},{"index":"4","countofevents":"lastchanceindex"},{"index":"0","countofevents":"\"_thefishbucket\""},{"index":"0","countofevents":"demomatricindex"},{"index":"0","countofevents":"history"},{"index":"0","countofevents":"netaddinsimport"},{"index":"0","countofevents":"popflowscriptindex"},{"index":"0","countofevents":"tunnelmetricsindex"}] |
list of data size, usage, average no. of bytes, events, event size of _raw and license index
2. Basic OpenMethods Topology
In this section, we’ll discover broadly where/how to look for events that give an overall view of how our software is deployed and being used.
2.0 I am not sure what I am looking for, how do I just explore Splunk data?
Code Block |
---|
index="main" earliest="-30m" source="mediabar" |
2.1 List of customers and their CTI environment/location of agents (or at least URL agents use to access HIS)?
Code Block |
---|
index="main" earliest="-4h"
| stats values(network.his.uri) as hisurl, values(network.his.model) as hiscti by crm.customer |
2.2 List of customers and their sites/versions?
Code Block |
---|
| tstats count WHERE (index="main" earliest=-5d@d latest=now source="mediabar") BY crm.customer mb.version network.appManagerDomain network.crmHost
| sort mb.version DESC | dedup network.crmHost
| table crm.customer mb.version network.crmHost | sort crm.customer |
2.3 How to segment agent usage by production versus lower environments?
Code Block |
---|
| tstats count WHERE (index="main" earliest="-1h" [|inputlookup spl-customer-host.csv | where cloudenv="prod"
| fields displaycustomer hostlookup
| lookup spl-customer-host.csv displaycustomer cloudenv OUTPUT hostlookup
| fields - displaycustomer | rename hostlookup as host | format]) BY _time crm.username host
| stats dc(crm.username) as total_User by host |
The Splunk ‘lookup’ data structure that made the above query possible:
Code Block |
---|
| inputlookup spl-customer-host.csv
| WHERE NOT (displaycustomer in ("omdemo","omdev", "omqa", "omtrain"))
| dedup displaycustomer
| lookup spl-customer-host.csv displaycustomer OUTPUT crmcustomer cloudenv hostlookup |
The high water mark of unique agent logins across production:
Code Block |
---|
| tstats distinct_count(crm.username) as dc1 WHERE (index="main" earliest="7/1/2020:00:00:00" latest="8/1/2020:00:00:00"
[|inputlookup spl-customer-host.csv | where cloudenv="prod" | fields displaycustomer hostlookup
| lookup spl-customer-host.csv displaycustomer cloudenv OUTPUT hostlookup
| fields - displaycustomer | rename hostlookup as host | format]) BY _time host
| timechart sum(dc1) as all_agents_prod_by_day span=1d
| stats max(all_agents_prod_by_day) as all_agents_prod |
2.4 How to convert Splunk events to look like basic log statements I am used to for troubleshooting?
Code Block |
---|
| tstats count WHERE (index="main" earliest="8/1/2020:06:00:00" latest="8/1/2020:06:30:00" source="mediabar" host="https://chewy.custhelp.com" )
BY _time logLevel crm.instanceId crm.groupId crm.id mb.className mb.functionName message span=1s |
Or to simplify down to a few meaningful fields and one agent (but we don’t know which agent). If we know the agent id, the sub search can be removed which can be a performance issue in some cases.
Code Block |
---|
| tstats count WHERE (index="main" earliest="-24h" host="https://faq.arval.it"
[ | tstats count WHERE (index="main" earliest="-24h" host="https://faq.arval.it") BY crm.id | top limit=1 crm.id| rename count as c | rename percent as p | fields - c p | format]
) BY _time logLevel crm.id mb.className mb.functionName message span=1s
| eval class='mb.className' . "-" . 'mb.functionName' | search class="*" | table _time logLevel crm.id class message |
2.5 How to identify, at a high level, the major components in use by the customer?
At a quick glance simply of component, it can be easily determined if an agent is getting screen pops from Harmony or another way. |
2. Omis Events
How to identify customer/agent using HIS/Harmony stack and how are they using it?
3. Popflow Events
How to identify customer/agent using Popflow and how they are using it?
...
This page is a Quick Reference containing a punch list of run anywhere (copy/paste) Splunk searches to help explore the data in the OpenMethods Splunk events. The use of some advanced techniques is intentional and the foundational explanation for the techniques will be covered in other articles.
...
Table of Contents | ||
---|---|---|
|
...
Conventions
In order to conserve space on this page with respect to writing queries and query results,
- It is a best practice, when readability matters, to author Splunk queries per above with the pipe symbol as the first character of each new line. Instead, the queries will be compacted.
- Search command code blocks are in gray, search results are in light blue.
- Sample results may not be included, or will be included as screenshots or converted to JSON and shown as a code block like this below (same results set as in 1.1 below).
|
Section 1: Foundations
1. Basic Internal Structure of Splunk
In this section, we’ll glance at Splunk’s own internal structure and how it manages indexes, sources, storage, event sizes, and types, and we’ll spot check the indexes _introspection and _internal.
1.1. What are the Splunk indexes where OM data is stored (even if there is currently zero data in the index)?
| |||
index=”main”: default index where all OM data livesindex=”demomatricsindex” (or ‘netaddinsimport’, ‘popflowscriptindex’, ‘tunnelmetricsindex’): indexes created by OM for targeted research/projectsall the remaining indexes or Splunk internal, more queries will be built up in this section as time permits, but for now focus is to shift to OM specific data. |
2. Basic OpenMethods Topology
In this section, we’ll discover broadly where/how to look for events that give an overall view of how our software is deployed and being used.
2.0. I am not sure what I am looking for, how do I just explore Splunk data?
Code Block | ||
---|---|---|
| ||
index="main" earliest="-30m" source="mediabar" |
Per the Splunk Search App Primer , after running this query the data can be explored by looking at search app results and exploring fields in the Fields sidebar.
2.1. List of customers and their CTI environment/location of agents (or at least URL agents use to access HIS)?
Code Block | ||
---|---|---|
| ||
index="main" earliest="-4h"
| stats values(network.his.uri) as hisurl, values(network.his.model) as hiscti by crm.customer |
Results (JSON + screenshot): (blank HIS URL would imply the site hasn’t been used in the given time window or it is Popflow-only)
| |||||
2.1.a. What are agent states for UCCE and their stats?
|
|
2.2. List of customers and their sites/versions?
Code Block |
---|
| tstats count WHERE (index="main" earliest=-5d@d latest=now source="mediabar") BY crm.customer mb.version network.appManagerDomain network.crmHost
| sort mb.version DESC | dedup network.crmHost
| table crm.customer mb.version network.crmHost | sort crm.customer |
2.3. How to segment agent usage by production versus lower environments?
This is also the simplest form of unique agent logins by host (customer URL).
Code Block |
---|
| tstats distinct_count(crm.username) as agents_dc_per_h WHERE (index="main" earliest="-1h@h"
[|inputlookup spl-customer-host.csv | where cloudenv="prod" | fields displaycustomer hostlookup
| lookup spl-customer-host.csv displaycustomer cloudenv OUTPUT hostlookup
| fields - displaycustomer | rename hostlookup as host | format]) BY _time host span=1h |
2.3.a. The Splunk ‘lookup’ data structure that made the above query possible:
Code Block |
---|
| inputlookup spl-customer-host.csv
| WHERE NOT (displaycustomer in ("omdemo","omdev", "omqa", "omtrain"))
| dedup displaycustomer
| lookup spl-customer-host.csv displaycustomer OUTPUT crmcustomer cloudenv hostlookup |
2.4. How to convert Splunk events to look like the regular HIS/CS log statements I am used too?
Code Block |
---|
| tstats count WHERE (index="main" earliest="8/1/2020:06:00:00" latest="8/1/2020:06:30:00" source="mediabar" host="https://chewy.custhelp.com" )
BY _time logLevel crm.instanceId crm.groupId crm.id mb.className mb.functionName message span=1s |
2.4.a. Simplify Log-Style Statements to Fewer Fields
Or to simplify log-style statements above down to a few meaningful fields and one agent (but for this case let’s say we don’t know which agent so we are using the ‘top’ agent). If we know the agent id, the sub-search (starts with left bracket '[') can be removed. In reality, a sub-search will usually be a performance hit and can be avoided by restructuring almost any search.
Code Block |
---|
| tstats count WHERE (index="main" earliest="-24h" host="https://faq.arval.it"
[ | tstats count WHERE (index="main" earliest="-24h" host="https://faq.arval.it") BY crm.id | top limit=1 crm.id| rename count as c | rename percent as p | fields - c p | format]
) BY _time logLevel crm.id mb.className mb.functionName message span=1s
| eval class='mb.className' . "-" . 'mb.functionName' | search class="*" | table _time logLevel crm.id class message |
2.5. How to identify, at a high level, the major components in use by the customer?
Currently, the majority of searches are centered around component names, ‘mb.className' and 'mb.functionName’, and string matching.For example, at a quick glance simply of a component, it can be easily determined if an agent is getting screen pops from Harmony or another way. |
2.5.a. Component names by version
Logging design is still undergoing changes, so the component names can vary by version.
Code Block | ||
---|---|---|
| ||
| tstats count WHERE (index="main" earliest="8/17/2020:08:00:00" latest="8/18/2020:08:00:00" source="mediabar") BY mb.version mb.className mb.functionName
| eval major=mvindex(split('mb.version', "."), 0) | eval class='mb.className' . "-" . 'mb.functionName' | search class="*"
| stats values(class) as lc, count(class) as cc by major |
3. Popflow Events
3.1. How to identify customer/agent using Popflow and how they are using it, aka Popflow Overview?
Code Block |
---|
| tstats count WHERE (index="main" earliest="8/17/2020:08:00:00" latest="8/18/2020:00:00:00" "mb.className"=PopflowRuntimeService ((message="*Event '*' *ed") OR (message="*Activity complete*") OR (message="*Starting Activity*") OR (message="*Activity event*") OR (message="*Got*popflow*" AND message!="*Got* 0*") OR (message="*Getting*"))) BY _time logLevel crm.customer crm.instanceId crm.groupId crm.id mb.className mb.functionName message span=1s
| rex field=message "^(?<mytitle>[^{\n]*)(?P<myjson>{.*})"
| eval jsonctx=substr(myjson, 1, 40), msgctx=substr(message, 1, 40)
| eval class='mb.className' . "-" . 'mb.functionName', crmgroup='crm.instanceId' . "-" . 'crm.groupId'
| search crmgroup="*" class="*"
| table _time logLevel crm.customer crmgroup crm.id class mytitle jsonctx msgctx |
Explanation:
a) Why the use of: ‘| search crmgroup="" class=”*”’ clause and all the string matching?
i) As described previously on this page, we are still dependent on string matching and class names. Writing fixed data points or metrics will be a better interface.
ii) The field ‘msgctx’ is present for context and would be used in the case where we are not filtering out ‘mb.className’. You see we are trying to populate ‘mytitle’ and ‘jsonctx’ fields and in the case they are blank might mean there is a message that I am not expecting so the parsing isn’t working on it. Finally, collapsing 2 fields down to 1 is simply for saving space so I can still see the ‘message’ field without scrolling.
b) One of the most important statements in this query is the use of regular expressions (pattern matching):
| rex field=message "^(?<mytitle>[^{\n]*)(?P<myjson>{.*})"
there is a page dedicated to tools for pattern match and JSON manipulation for Splunk, keep checking back for updates.
...
Info |
---|
From here, we are going to keep building upon the Popflow Overview, extract some new information, until we have a fully populated breakdown of the events. |
There are workflows authored to act off events and "event detected" messages, which can have a corresponding action to fetch a workflow as "getting popflow for eventId" messages, followed by a "got popflow" message which loads workflow and starts to run activities of different types and tracks "starting activity" and "activity complete" messages.
...
3.2. What Popflow Events are Being Triggered and are the Most Frequent?
|
|
3.3. What Popflow Activities are Being Run?
In overall product usage tracking, I like to track workflows being run and the number of instructions (aka Activities) as an overall indicator of scale and volume. But let’s start with an Activity overview in a log-format style.
| |||||
From Fields Panel, click on ‘custom.displayName’ for Top 10 Values |
3.4. Start Normalizing the Data, Put Events, Popflow Scripts, and Activities All Together in Context
| |||||
What did we add over the previous queries? a) 2 or 3 ‘rex’ commands were all handled now in one ‘rex’ command. b) we extracted ‘eventId’ by string parsing of the ‘message’ field and extracted ‘typeid’ (activity type id) from JSON and then used a lookup table to translate them to friendly names. c) multiple ‘eval' commands got moved to a single pipe as there is overhead for each pipe d) there is no single normalized field which is common to all event types (which makes it difficult to manipulate and combine the data later) so we added ‘msgtype’ e) the search patterns on the ‘message’ field in the very first segment of the search, when Splunk finds a match in a pipe it stops processing the rest so I made search patterns more explicit and ordered them by frequency of occurrences so there is a higher chance Splunk will find a match and do less processing. note: the technique for finding frequency of occurrences of the ‘message’ field was the same as we’ve used on this page, which goes something like … '<your search> | stats count(msghdr) as cntmsghdr by msghdr' | sort cntmsghdr DESC |
4. Omis Events
4.1. How to identify customer/agent using HIS/Harmony stack and how are they using it, aka Omis Overview?
| |||||
4.2. What are all the possible Omis message types and how do I work with them?
4.3. How do I check if there are any Omis message types I don't know about?
Previously on this page, it was stated that if there is a long evaluation or conditional command (for example string match), Splunk would grab the first match and stop processing. Thus, it would reduce processing and improve performance in theory if the search matches are ordered in the frequency of occurrence.
While leveraging that concept, there wasn’t an immediate obvious performance impact but the side effect was a search command which verifies that your query is structured so that it processes every message type and if one is not known certain fields would be null. You could use a similar concept to uniquely identify every Omis ERROR across every CTI platform and customer, well possibly.
| |||||