Skip to content

Week 12 - Files & Databases

File Manipulation and Persistent Storage in Python

Objectives

By the end of this lesson, students will: * Understand how to work with file names and paths using the os and pathlib modules. * Master string formatting using f-strings. * Read and write YAML files using the yaml module. * Use the shelve module to create and manipulate persistent key-value stores. * Perform file equivalency checks using hash functions. * Walk through directories and perform operations on files.

File Names and Paths

  • Absolute Path: An absolute path is the whole path to a file or directory from the root of the file system (Think C: on Windows or / on Nix based systems). It specifies the exact location of the file or directory, starting from the root directory. Absolute paths are unambiguous and always point to the same location, regardless of the current working directory.

    • Example: /home/viable/documents/a_file.txt or C:\Users\viable\Documents\file.txt
  • Relative Path: A relative path specifies the location of a file or directory relative to the current working directory. It is NOT fixed and can change based on where your program is executed. Relative paths are useful when you want to make file navigation more flexible within your project directory.

    • Example: ./documents/file.txt or ../file.txt.
  • Note in the code below: import os and pathlib, os.path.join, overloaded / path operator, path.resolve()

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    import os
    from pathlib import Path
    
    # Using os.path to join and get absolute paths
    file_name = 'example.txt'
    dir_path = '/home/user/documents'
    full_path = os.path.join(dir_path, file_name)
    print("Full path using os.path:", os.path.abspath(full_path))
    
    # Using pathlib for path manipulations
    path = Path('/home/user') / 'documents' / file_name
    print("Full path using pathlib:", path.resolve())
    

  • Note 1: The resolve method returns an absolute path for the path object after resolving any symbolic links and or relative path entries.
  • Note 2: / is overloaded by overloading the __truediv__(self, scalar): method in pathlib’s Path class.

String Formatting with f-strings (A Review)

  • As you know by now, F-strings are a concise way to format strings using expressions inside curly braces {} preceded by an f or F.
  • They are more convenient because they are easier to read and write compared to older formatting methods like printf used in C based languages.
  • F-Strings allow inline variable or expression evaluation directly in the string, which makes the output more human readable.

  • Note in the code below: The f, the {file_name}, the {age + 42} expression.

    1
    2
    3
    4
    5
    6
    7
    name = "Alice"
    age = 30
    print(f"Hello, {name}! You are {age + 42} years old.")
    
    file_path = "/home/user/documents"
    file_name = "data.txt"
    print(f"The file {file_name} is located in {file_path}.")
    

Reading and Writing YAML Files

  • YAML What Is It?: Stands for Yet Another Markup Language
  • It is a human-readable way for presenting, structuring, and storing data in a file or data-store.

    • In the output.yaml file below, we can see how this code is formatted.
  • Note in the code below: import yaml module, load vs safe_load, dump vs safe_dump.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    import yaml
    
    # Writing YAML data to a file
    data = {'name': 'John', 'age': 25, 'skills': ['Python', 'Machine Learning']}
    with open('output.yaml', 'w') as file:
        yaml.dump(data, file)
    
    # Reading YAML data from a file
    with open('output.yaml', 'r') as file:
        loaded_data = yaml.safe_load(file)
    print("Loaded YAML data:", loaded_data)
    

  • In the yaml module, load can parse any YAML, including potentially unsafe types like Python objects, while safe_load only parses standard YAML types to avoid security risks.
  • Similarly, dump can serialize Python objects into YAML with more flexibility, while safe_dump restricts serialization to standard YAML types to ensure safety.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    import yaml
    import os
    
    # Define a custom constructor that could execute system commands
    def run_constructor(loader, node):
        command = loader.construct_scalar(node)
        os.system(command)  # This is the dangerous part, but it does happen!
    
    # Register the custom constructor with a YAML tag
    yaml.add_constructor('!run', run_constructor)
    
    # Arbitrary YAML now allowed to run system commands!
    # This could come from a YAML file, not just this string.
    malicious_yaml = """
    run: !run "echo 'This is malicious code execution!'"
    """
    
    yaml.load(malicious_yaml, Loader=yaml.FullLoader)
    

Using Shelve for Persistent Storage

  • Shelve is a simple way to store Python objects persistently
  • It is Key/Value based just like a Python dictionary
  • Useful for small/medium sized data that doesn’t requrie a database.
  • Common uses:
    • Storing settings, preferences, and application state
    • Caching common data to reduce computational overhead.
    • Persisting user data for small data projects.
  • Example:
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    import shelve
    
    # Writing to a shelf
    with shelve.open('data_store') as db:
        db['username'] = 'john_doe'
        db['is_active'] = True
        db['last_login'] = '2024-11-08'
    
    # Reading from a shelf
    with shelve.open('data_store') as db:
        print("Username:", db['username'])
        print("Is active:", db['is_active'])
        print("Last login:", db['last_login'])
    

Checking File Equivalency Using Hash Functions

  • We can use MD5 hashes to compare the contents of 2 files for equality. HOW?
  • We use the hashlib md5 algorithm to hash each files contents. WHY this works?
  • MD5 accidental collisions are extremely rare due to the vast number of possible hash values it can produce.
    • MD5 generates a 128-bit hash, which means it can produce 2^128 different hash values.
    • This is an incredibly large number, and the probability of two random inputs producing the same hash value is incredibly small.  
    • To illustrate this, consider the following analogy:
      • Imagine you have a vast number of pigeonholes (representing the possible hash values) and a much smaller number of pigeons (representing the inputs). The probability of two pigeons ending up in the same pigeonhole is very small.
  • Code Items to Note: import hashlib, def md5_digest, mode rb.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    import os
    
    def find_txt_files(directory, depth=5):
      for root, dirs, files in os.walk(directory):
        if depth == 0:
          break
        for file in files:
          if file.endswith('.txt'):
            print(os.path.join(root, file))
        depth -= 1
    
    root_dir = "/path/to/your/root/directory"
    find_txt_files(root_dir)
    

Walking Through Directories

  • Let’s just look at the code, as it is pretty self explanatory!!!

  • The Recursive way

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    #!/usr/bin/env python3
    import os
    
    def recursive_dir_walk(directory, max_depth=5):
        if max_depth == 0:
            return
    
        for entry in os.listdir(directory):
            full_path = os.path.join(directory, entry)
            if os.path.isdir(full_path):
                recursive_dir_walk(full_path, max_depth - 1)
            elif entry.endswith(".txt"):
                print(full_path)
    
    root_directory = "/Users/trevorhartman/CR/thartmanoftheredwoods.mkdocs"
    recursive_dir_walk(root_directory)
    

  • The os.walk way

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    #!/usr/bin/env python3
    
    import os
    
    def non_recursive_dir_walk(directory, max_depth=4):
        for root, dirs, files in os.walk(directory, topdown=True):
            depth = root.count(os.sep) - directory.count(os.sep)
            if depth >= max_depth:
                # Prune directories at the max depth
                del dirs[:]
    
            for file in files:
                if file.endswith(".txt"):
                    print(os.path.join(root, file))
    
    root_directory = "/Users/trevorhartman/CR/thartmanoftheredwoods.mkdocs"
    non_recursive_dir_walk(root_directory)
    


Exercises

  1. Replace Word in File: Write a function replace_in_file that takes a target word, a replacement word, and two file paths as arguments. It should read the contents of the first file, replace occurrences of the target word, and write the modified contents to the second file.

    • Hint: Use open() and with statements for file operations.
    • Example:
      1
      replace_in_file('oldword', 'newword', 'source.txt', 'destination.txt')
      
  2. YAML Configuration Loader: Write a script that reads a YAML configuration file and prints out a formatted message for each configuration setting. If the setting is a dictionary, print its keys and values.

    • Example:
      1
      2
      # Config example: {'database': {'host': 'localhost', 'port': 3306}}
      print_config('config.yaml')
      
  3. Directory Image Finder: Create a function find_images_in_directory that takes a directory path and a list of image extensions (e.g., ['.png', '.jpg']). Use os.walk to find and list all image files in the directory and its subdirectories.

    • Example:
      1
      find_images_in_directory('/path/to/images', ['.png', '.jpg'])