Machine Learning for Cybersecurity Cookbook

上QQ阅读APP看书，第一时间看更新

How to do it...

In the following steps, we will collect notable portions of the PE header:

Import pefile and modules for enumerating our samples:

import pefile
from os import listdir
from os.path import isfile, join

directories = ["Benign PE Samples", "Malicious PE Samples"]

We define a function to collect the names of the sections of a file and preprocess them for readability and normalization:

def get_section_names(pe):
    """Gets a list of section names from a PE file."""
    list_of_section_names = []
    for sec in pe.sections:
        normalized_name = sec.Name.decode().replace("\x00", "").lower()
        list_of_section_names.append(normalized_name)
    return list_of_section_names

We define a convenience function to preprocess and standardize our imports:

def preprocess_imports(list_of_DLLs):
    """Normalize the naming of the imports of a PE file."""
    return [x.decode().split(".")[0].lower() for x in list_of_DLLs]

We then define a function to collect the imports from a file using pefile:

def get_imports(pe):
    """Get a list of the imports of a PE file."""
    list_of_imports = []
    for entry in pe.DIRECTORY_ENTRY_IMPORT:
        list_of_imports.append(entry.dll)
    return preprocess_imports(list_of_imports)

Finally, we prepare to iterate through all of our files and create lists to store our features:

imports_corpus = []
num_sections = []
section_names = []
for dataset_path in directories:
    samples = [f for f in listdir(dataset_path) if isfile(join(dataset_path, f))]
    for file in samples:
        file_path = dataset_path + "/" + file
        try:

In addition to collecting the preceding features, we also collect the number of sections of a file:

            pe = pefile.PE(file_path)
            imports = get_imports(pe)
            n_sections = len(pe.sections)
            sec_names = get_section_names(pe)
            imports_corpus.append(imports)
            num_sections.append(n_sections)
            section_names.append(sec_names)

In case a file's PE header cannot be parsed, we define a try-catch clause:

        except Exception as e:
            print(e)
            print("Unable to obtain imports from " + file_path)