[Discussion] How to detect anomalies (errors and exceptions) in log files?
Is this a good approach?
So I’m working on a Root Cause Analysis system which should help find the cause/the root error of failed system builds (packaged in a tarball), through the analysis of log files (database logs, system logs, etc) inside it.
The only labeling of data available is whether a specific tarball contains the logs of a failed system build or rather the logs of a successful build.
Instead of trying to find the specific log statements which mention the root cause, I though it would be easier to first find the log file in which this specific lines are saved. A log file containing failure log statements should be detected as an “outlier” log file, compared to the normal log files, that are created from successful builds. If we find log files that are outliers, we would only have to search the failure log statements in this few log files.
My strategy to find the anomalous log files (outliers) so far:
Consider a list of tars containing the logs of a successful build. For each tar:
- Extract the tar and consider only log files (filter/remove configuration files, cache files, etc)
- Group the log files per service (mongodb, apt, fsck, etc)
For each service:
- Remove the timestamp from the log body for each log statement, inside each log file
- Concatenate, combine all the log statements of the same log file into a single string, let’s call it “log file content”
- Create an array containing the log file contents of each log file
- Use the Tf-ifd transformer on this array (TfidfVectorizer) to create a dataframe (fit/transform)
- Create an isolation forest (sklearn.ensemble IsolationForest) or a one class SVM (sklearn.svm OneClassSVM) called model
- Fit the dataframe inside the model
To find anomalous log files in a failed tar, we analyse it the following way:
- Extract and group the log files (see previous step 1 and 2)
For every file:
2.1. Preprocess it the same way the successfull log files for the training were preprocessed (removal of timestamp, concatenate, Tf-ifd transform, etc), into a dataframe
2.2. Whether this log file is anomalous or not, will be predicted by passing the dataframe as argument to the “predict” function” of the model of the corresponding service
What do you think?
- Is this a good approach?
- I think OneClassSVM would be better suited since it can be used for semi supervised learning (ideal for this case), instead of unsupervised learning. But I’m getting too much false positives. Any tips on choosing the right values for the nu and gamma parameters?
- Other suggestions?