The content of the article
Malware creators use many different methods to hide their creations from anti-virus tools and static-dynamic analyzers. However, antiviruses are not bastards either: they use advanced hashing algorithms to search for "related" samples. Today we will tell you how these algorithms work – with details and illustrative examples.
In practice, in most cases, the existing base or core of the malware is reused to create a new type of malware. Virus writers don't bother much with the time-consuming development of new "quality" viruses, but simply use existing samples.
This reusable code can be reassembled with another compiler, removed from it, or, conversely, new functions can be added to it. Some libraries are updated, the distribution of the code within the file is changed (with new linkers, packers, obfuscation, and so on). The point of such transformations is to give a new look to the well-known malware program. In this case, the modified version of the virus will remain undetected for some time. However, there are ways to detect this kind of repackaging and modification.
These discovery techniques are often used to analyze a large data set and find common elements in it. Globally distributed knowledge bases such as Virus Total and antivirus company databases, as well as Threat Intelligence approaches, can be used as practical examples of using similar techniques.
Hash is "clear"
Hans Peter Lun of IBM developed systems for information analysis in the 1940s, including researching the storage, transmission and retrieval of text data. This led him to create transformation algorithms, and then to hashing information as a way to find phone numbers and text. This is how indexing and the concept of divide and conquer took their first steps in computing.
Now there are many hashing algorithms that differ in cryptographic strength, computation speed, bit width and other characteristics.
We are used to associating hash functions with cryptographic hash functions. It is a common tool used for a variety of tasks, such as:
- electronic signature;
- detection of malware (both files and their markers of compromise).
In today's article, we'll talk about how various hashing algorithms help us fight malware.
What is hash
A cryptographic hash function, often referred to simply as a hash, is a mathematical transformation that translates an arbitrary input array of data into a fixed-length string of letters and numbers. A hash is considered cryptographically secure if the following is true:
- the hash cannot be used to recover the original data;
- collision resistance is performed, that is, it is impossible to get the same hashes from different input sequences.
MD5, SHA-1 and SHA-256 are the most popular cryptographic hash algorithms that are often used in malware detection. More recently, malware was identified only by the signature (hash) of the executable file.
But in modern realities it is not enough to know just the hash of an object, since this is a weak indicator of compromise (IoC). IoC is all artifacts from which malware can be identified. For example, the registry branches it uses, the libraries being loaded, IP addresses, byte sequences, software versions, date and time triggers, ports involved, URLs.
Consider the attacker's pyramid of pain invented by cyber security analyst David Bianco. It describes the difficulty levels of indicators of compromise that attackers use in attacks. For example, if you know the MD5 hash of a malicious file, it can be detected quite easily and accurately on the system. However, this will bring very little pain to the attacker – it is enough to add one bit of information to the malware file, and the hash will change. Thus, the virus can migrate endlessly, and each new copy of it will have a different hash from other copies.
If you are dealing with multiple malicious samples, it becomes clear that most of them are not inherently unique. Attackers often borrow or buy sources from each other and use them in their programs. Very often, after the source codes of some malicious software appear on the Internet, numerous crafts made from available fragments pop up.
How can you determine the similarity between different samples of the same malware family?
To find such similarities, there are special algorithms for calculating the hash, for example, fuzzy hashing and hash of imported libraries (imphash). The two approaches use different detection methods to find recurring fragments of malware belonging to specific families. Let's take a closer look at these two methods.
"Fuzzy" hash – SSDeep
If in cryptographic hash functions the essence of the algorithm is that at the slightest change in the input data (even one bit of information) their hash also changes significantly, then in fuzzy hashes the result changes insignificantly or does not change at all. That is, fuzzy hashes are more resistant to small changes in the file. Therefore, such functions make it much more efficient to detect new modifications of malware and do not require large resources for calculation.
Fuzzy hashing is a technique in which a program such as SSDeep calculates piecewise hashes from the input data, that is, it uses so-called context-invoked piecewise hashing. In English sources, this method is called context triggered piecewise hashing (CTPH aka fuzzy hashing).
In fact, there are quite a few classifications of fuzzy hashes. For example, according to the mechanism of operation, algorithms are divided into piecewise hashing, context triggered piecewise hashing, statistically improbable features, block-based rebuilding. By the type of information processed, they can be divided into byte, syntactic and semantic. But when it comes to fuzzy hashes, then this is usually CTPH.
The SSDeep algorithm was developed by Jesse Kornblum for use in computer forensics and is based on the spamsum algorithm. SSDeep calculates several traditional, fixed-size cryptographic hashes for individual file segments, thereby enabling the detection of similar objects. The SSDeep algorithm uses a sliding window mechanism rolling hash… It can also be called recursive piecewise hashing.
Often, CTPH-like hashes are at the heart of locality-sensitive hashing (LSH) algorithms. Their tasks include finding the nearest neighbors (ANN), or, more simply, similar objects, but with a slightly higher level abstraction. LSH algorithms are used not only in the fight against malware, but also in multimedia, when searching for duplicates, searching for similar genes in biology and many other places.
How does SSDeep work? At first glance, everything is pretty simple:
- it splits the file into smaller pieces and examines them, not the whole file;
- it can identify fragments of files that have sequences of the same bytes in a similar order, or bytes between two sequences, where they can differ in value and length.
Virus Total uses SSDeep, which performs fuzzy hashing on user-uploaded files. Example – by link…
Continuation is available only to participants
Materials from the latest issues become available separately only two months after publication. To continue reading, you must become a member of the "Xakep.ru" community.
Join the Xakep.ru community!
Membership in the community within the specified period will open you access to ALL materials of the "Hacker", will allow you to download issues in PDF, disable ads on the site and increase your personal cumulative discount!
I am already a member of "Xakep.ru"