Week 12 - Files & Databases
File Manipulation and Persistent Storage in Python¶
Objectives¶
By the end of this lesson, students will:
* Understand how to work with file names and paths using the os
and pathlib
modules.
* Master string formatting using f-strings.
* Read and write YAML files using the yaml
module.
* Use the shelve
module to create and manipulate persistent key-value stores.
* Perform file equivalency checks using hash functions.
* Walk through directories and perform operations on files.
File Names and Paths¶
-
Absolute Path: An absolute path is the whole path to a file or directory from the root of the file system (Think C: on Windows or / on Nix based systems). It specifies the exact location of the file or directory, starting from the root directory. Absolute paths are unambiguous and always point to the same location, regardless of the current working directory.
- Example:
/home/viable/documents/a_file.txt
orC:\Users\viable\Documents\file.txt
- Example:
-
Relative Path: A relative path specifies the location of a file or directory relative to the current working directory. It is NOT fixed and can change based on where your program is executed. Relative paths are useful when you want to make file navigation more flexible within your project directory.
- Example:
./documents/file.txt
or../file.txt
.
- Example:
-
Note in the code below:
import os and pathlib
,os.path.join
, overloaded/
path operator,path.resolve()
1 2 3 4 5 6 7 8 9 10 11 12
import os from pathlib import Path # Using os.path to join and get absolute paths file_name = 'example.txt' dir_path = '/home/user/documents' full_path = os.path.join(dir_path, file_name) print("Full path using os.path:", os.path.abspath(full_path)) # Using pathlib for path manipulations path = Path('/home/user') / 'documents' / file_name print("Full path using pathlib:", path.resolve())
- Note 1: The resolve method returns an absolute path for the path object after resolving any symbolic links and or relative path entries.
- Note 2:
/
is overloaded by overloading the__truediv__(self, scalar):
method in pathlib’s Path class.
String Formatting with f-strings (A Review)¶
- As you know by now, F-strings are a concise way to format strings using expressions inside curly braces
{}
preceded by anf
orF
. - They are more convenient because they are easier to read and write compared to older formatting methods like printf used in C based languages.
-
F-Strings allow inline variable or expression evaluation directly in the string, which makes the output more human readable.
-
Note in the code below: The
f
, the{file_name}
, the{age + 42}
expression.1 2 3 4 5 6 7
name = "Alice" age = 30 print(f"Hello, {name}! You are {age + 42} years old.") file_path = "/home/user/documents" file_name = "data.txt" print(f"The file {file_name} is located in {file_path}.")
Reading and Writing YAML Files¶
- YAML What Is It?: Stands for Yet Another Markup Language
-
It is a human-readable way for presenting, structuring, and storing data in a file or data-store.
- In the output.yaml file below, we can see how this code is formatted.
-
Note in the code below:
import yaml
module,load
vssafe_load
,dump
vssafe_dump
.1 2 3 4 5 6 7 8 9 10 11
import yaml # Writing YAML data to a file data = {'name': 'John', 'age': 25, 'skills': ['Python', 'Machine Learning']} with open('output.yaml', 'w') as file: yaml.dump(data, file) # Reading YAML data from a file with open('output.yaml', 'r') as file: loaded_data = yaml.safe_load(file) print("Loaded YAML data:", loaded_data)
- In the
yaml
module,load
can parse any YAML, including potentially unsafe types like Python objects, whilesafe_load
only parses standard YAML types to avoid security risks. - Similarly,
dump
can serialize Python objects into YAML with more flexibility, whilesafe_dump
restricts serialization to standard YAML types to ensure safety.1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
import yaml import os # Define a custom constructor that could execute system commands def run_constructor(loader, node): command = loader.construct_scalar(node) os.system(command) # This is the dangerous part, but it does happen! # Register the custom constructor with a YAML tag yaml.add_constructor('!run', run_constructor) # Arbitrary YAML now allowed to run system commands! # This could come from a YAML file, not just this string. malicious_yaml = """ run: !run "echo 'This is malicious code execution!'" """ yaml.load(malicious_yaml, Loader=yaml.FullLoader)
Using Shelve for Persistent Storage¶
- Shelve is a simple way to store Python objects persistently
- It is Key/Value based just like a Python dictionary
- Useful for small/medium sized data that doesn’t requrie a database.
- Common uses:
- Storing settings, preferences, and application state
- Caching common data to reduce computational overhead.
- Persisting user data for small data projects.
- Example:
1 2 3 4 5 6 7 8 9 10 11 12 13
import shelve # Writing to a shelf with shelve.open('data_store') as db: db['username'] = 'john_doe' db['is_active'] = True db['last_login'] = '2024-11-08' # Reading from a shelf with shelve.open('data_store') as db: print("Username:", db['username']) print("Is active:", db['is_active']) print("Last login:", db['last_login'])
Checking File Equivalency Using Hash Functions¶
- We can use MD5 hashes to compare the contents of 2 files for equality. HOW?
- We use the hashlib md5 algorithm to hash each files contents. WHY this works?
- MD5 accidental collisions are extremely rare due to the vast number of possible hash values it can produce.
- MD5 generates a 128-bit hash, which means it can produce 2^128 different hash values.
- This is an incredibly large number, and the probability of two random inputs producing the same hash value is incredibly small.
- To illustrate this, consider the following analogy:
- Imagine you have a vast number of pigeonholes (representing the possible hash values) and a much smaller number of pigeons (representing the inputs). The probability of two pigeons ending up in the same pigeonhole is very small.
- Code Items to Note:
import hashlib
,def md5_digest
, moderb
.1 2 3 4 5 6 7 8 9 10 11 12 13
import os def find_txt_files(directory, depth=5): for root, dirs, files in os.walk(directory): if depth == 0: break for file in files: if file.endswith('.txt'): print(os.path.join(root, file)) depth -= 1 root_dir = "/path/to/your/root/directory" find_txt_files(root_dir)
Walking Through Directories¶
-
Let’s just look at the code, as it is pretty self explanatory!!!
-
The Recursive way
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
#!/usr/bin/env python3 import os def recursive_dir_walk(directory, max_depth=5): if max_depth == 0: return for entry in os.listdir(directory): full_path = os.path.join(directory, entry) if os.path.isdir(full_path): recursive_dir_walk(full_path, max_depth - 1) elif entry.endswith(".txt"): print(full_path) root_directory = "/Users/trevorhartman/CR/thartmanoftheredwoods.mkdocs" recursive_dir_walk(root_directory)
-
The os.walk way
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
#!/usr/bin/env python3 import os def non_recursive_dir_walk(directory, max_depth=4): for root, dirs, files in os.walk(directory, topdown=True): depth = root.count(os.sep) - directory.count(os.sep) if depth >= max_depth: # Prune directories at the max depth del dirs[:] for file in files: if file.endswith(".txt"): print(os.path.join(root, file)) root_directory = "/Users/trevorhartman/CR/thartmanoftheredwoods.mkdocs" non_recursive_dir_walk(root_directory)
Exercises¶
-
Replace Word in File: Write a function
replace_in_file
that takes a target word, a replacement word, and two file paths as arguments. It should read the contents of the first file, replace occurrences of the target word, and write the modified contents to the second file.- Hint: Use
open()
andwith
statements for file operations. - Example:
1
replace_in_file('oldword', 'newword', 'source.txt', 'destination.txt')
- Hint: Use
-
YAML Configuration Loader: Write a script that reads a YAML configuration file and prints out a formatted message for each configuration setting. If the setting is a dictionary, print its keys and values.
- Example:
1 2
# Config example: {'database': {'host': 'localhost', 'port': 3306}} print_config('config.yaml')
- Example:
-
Directory Image Finder: Create a function
find_images_in_directory
that takes a directory path and a list of image extensions (e.g.,['.png', '.jpg']
). Useos.walk
to find and list all image files in the directory and its subdirectories.- Example:
1
find_images_in_directory('/path/to/images', ['.png', '.jpg'])
- Example: