To be effective, Medusa has to be kept up to date with all the changes that happen in Scavenger. This includes new files arriving, and those files being classified as malware or benign, or being removed from the sets that are relevant for classification. A durable message queue in RabbitMQ is used to ensure that no updates are lost even during Medusa server updates and maintenance. Additionally, the consistency between Medusa and Scavenger is continuously checked with complete dumps of object identifiers.
Real-time classification is required whenever a file is too new and a machine is having trouble deciding if the file is malicious. Given that there are 400 million Avast users, there is potential for a massive overload when new threats appear. To eliminate this, we implemented a caching proxy between the clients and the Medusa cluster. A file can be classified differently, as new information arrives from Scavenger, so the TTL of the cached decisions is set to a few minutes. Despite this, the cache hits in almost 40% of the requests.
Every Medusa node in our deployment uses two or four Nvidia GPUs. A Medusa cluster has one master node that is aware of all the sets, and several slave nodes, which contain some parts of the sets. The classification needs many clean and malicious samples. The Evo-gen generator also uses a set with unclassified samples. Because of a large difference in usage patterns, we keep the different sets separate. Clean set samples are the most important because of the inherently high costs of a false positive. Thus, the clean set takes most of the space and is proportionally the slowest to scan. To increase the throughput, we keep the clean set mirrored. The sets with recent malware and unclassified samples take up only a fraction of the space —roughly 10% — needed for a clean set.
Each sample is represented by a constant-sized feature vector consisting of approximately 100 attributes. We keep the exact composition of the feature vector secret, but, for example, obvious candidates such as section table data in the Portable Executable format are included. In general, there are static and dynamic features, categorized as offsets, sizes, checksums, factors, bit flags and generic numbers. Taking into account the nature of the attributes, we ended up with several distance operators and a weighting scheme that equalizes the importance of the attributes. The following table contains a sample of the operators we use.
|Distance operator||Field types||Dexription|
|EQUAL_RET32||Checksum, value||return 32 when values are equal|
|Ham-Mul32||Bitfield||Hamming distance multiplied by 32 - each flag change is as important as maximal change of one feature|
|Log||Length, Offset||base 2 logarithm of a difference|
|Order||Length, Offset, Value||difference of base 2 logarithms|
|Retz||all||ignore the feature, return 0 for all values|
The most common approach for instance-based learning is the nearest neighbor classification. To fine tune our classifier, we built a tool, called Pythia, which displays the nearest neighbors of a given query sample. It uses a dimensionality reduction method (NMDS) to display the neighbors in 2D space, and also displays additional metadata for the selected samples. This information can be used by a human to determine whether or not it is feasible to distinguish between malware and clean neighbors in the current case. The goal was to create a fully autonomous system — which means high precision at the cost of lower recall. After some experimenting, we added a few thresholds, including minimal allowed distance to clean files, maximal allowed distance to malware files, as well as a weighting term that shifts the balance between clean and malware sets.
The redundancy in real world data is quite significant. Our internal systems handle around 250,000 new PE files every day. Out of those, 150,000 can be directly assigned to one of 20,000 clusters using very strict clustering criteria (low threshold distance and complete linkage). Each cluster can then be classified as a whole. That means 130,000 fewer decisions to make, and that the total number of clusters does not grow by 20,000 every day, as the clusters overlap between days.
|Method||How many users||Delay in classification||How many file versions|
|Real-time classification||single user||no delay||one|
Avast checks every executable before it’s executed in a customer’s machine. When no signature from the current threat database matches the file, the FileRep service is queried. If the returned user count (prevalence) is anomalously low, the executable ends up in the Avast Sandbox. If the executable trace log does not match any known threat, the real-time classifier is invoked. Avast extracts the feature vector, submits it to a cloud-based service, and waits for the response. Most of the low prevalence files are benign. Out of approximately 250,000 requests daily, about 4,000 are classified as malicious.
Once a file is classified as malicious and our internal systems check that it is safe to detect this particular file worldwide, a simple flag is set in the FileRep service. Every Avast client that encounters that particular file instantly blocks it and reports it as FileRepMalware.
Old string-based signatures work well when properly executed, and are especially good at generalizing many variants of a threat. But string-based signatures require an analyst and time. In today’s threat landscape, with all its variants and interconnectivity, there just aren’t enough people or enough time to keep pace. A new approach was required that generalizes like string signatures do, but doesn’t rely on human intervention or take as much time.
Enter Evo-gen. Evo-gen leverages the distance function to create a set of similar feature vectors which allows us to build a rule set from those features. Once we have a set of very similar feature vectors from the distance function, we can start to pick features that make them similar and build a rule set from those features. It is somewhat similar to rule-set generation in decision trees, but the objectives are different. To boost the generalization, we can pick as few rules as possible, while keeping hits in the clean set at zero. But there are many ways to pick 20 rules from 100 possible ones - 5.36x1020, or 536 billion, numerically speaking. We’re currently taming the combinatorial explosion with a stochastic approach, which provides better results than Scavenger approaches. This is where the speed of the GPUs is very important again. While trying to understand how the Evo-gen rule sets (blue) affect the signature “ecosystem,” we produced the following visualization. Each blob represents a different rule set or signature, and the size of the blob is proportional to the number of detected variants.
Avast: The Smarter Security Solution
de GRATIS Chrome™-internetbrowser.