Table Of Contents
- Data Persistence Made Simple
- NumPy's Native Formats
- Text-Based Formats for Human Readability
- Memory Mapping for Large Arrays
- Binary Data with Pickle
- Cross-Language Compatibility
- Performance and Format Comparison
- Best Practices
- Explore More
Data Persistence Made Simple
Your carefully crafted NumPy arrays shouldn't vanish when your program ends. Learn to save them efficiently and load them back exactly as they were.
NumPy's Native Formats
import numpy as np
# Create sample data
data = np.random.rand(1000, 100)
labels = np.array(['cat', 'dog', 'bird'] * 100)
# Save single array (.npy format)
np.save('my_data.npy', data)
loaded_data = np.load('my_data.npy')
print(np.array_equal(data, loaded_data)) # True
# Save multiple arrays (.npz format)
np.savez('dataset.npz', features=data, labels=labels)
loaded = np.load('dataset.npz')
print(loaded['features'].shape) # (1000, 100)
print(loaded['labels'][:3]) # ['cat' 'dog' 'bird']
# Compressed format (saves space)
np.savez_compressed('compressed_data.npz',
large_array=np.random.rand(10000, 1000))
# With context manager (automatically closes file)
with np.load('dataset.npz') as data:
features = data['features']
labels = data['labels']
Text-Based Formats for Human Readability
# Save as text (CSV-like)
small_array = np.random.rand(5, 3)
np.savetxt('data.txt', small_array, delimiter=',', fmt='%.4f')
# Load text data
loaded_text = np.loadtxt('data.txt', delimiter=',')
# Custom formatting
np.savetxt('formatted.txt', small_array,
fmt='%.2e', # Scientific notation
delimiter='\t', # Tab separated
header='col1\tcol2\tcol3', # Header row
comments='# ') # Comment prefix
# Handling mixed data types
mixed_data = np.array([('Alice', 25, 1.75), ('Bob', 30, 1.80)],
dtype=[('name', 'U10'), ('age', 'i4'), ('height', 'f4')])
np.savetxt('mixed.txt', mixed_data, fmt='%s %d %.2f')
Memory Mapping for Large Arrays
# Memory-mapped arrays for huge datasets
huge_array = np.random.rand(100000, 1000)
# Save as memory-mapped file
mmap_array = np.memmap('huge_data.dat', dtype='float64', mode='w+',
shape=(100000, 1000))
mmap_array[:] = huge_array[:] # Copy data
del mmap_array # Flush to disk
# Load as memory-mapped (doesn't load into RAM immediately)
loaded_mmap = np.memmap('huge_data.dat', dtype='float64', mode='r',
shape=(100000, 1000))
print(loaded_mmap[0, :5]) # Access specific parts without loading all
Binary Data with Pickle
import pickle
# Complex objects with metadata
class DataContainer:
def __init__(self, data, metadata):
self.data = data
self.metadata = metadata
container = DataContainer(np.random.rand(100, 50),
{'created': '2024-01-01', 'version': 1.0})
# Save with pickle
with open('container.pkl', 'wb') as f:
pickle.dump(container, f)
# Load with pickle
with open('container.pkl', 'rb') as f:
loaded_container = pickle.load(f)
print(loaded_container.metadata)
Cross-Language Compatibility
# HDF5 format (requires h5py)
# Great for large datasets and cross-language compatibility
try:
import h5py
with h5py.File('data.h5', 'w') as f:
f.create_dataset('array1', data=np.random.rand(1000, 100))
f.create_dataset('array2', data=np.random.rand(500, 200))
f.attrs['description'] = 'My dataset'
with h5py.File('data.h5', 'r') as f:
loaded_array = f['array1'][:]
description = f.attrs['description']
except ImportError:
print("Install h5py for HDF5 support: pip install h5py")
Performance and Format Comparison
Format | Speed | Size | Cross-platform | Human Readable |
---|---|---|---|---|
.npy | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ | ❌ |
.npz | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ❌ |
.txt | ⭐⭐ | ⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
.h5 | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ❌ |
Best Practices
- Use
.npy
for single arrays,.npz
for multiple arrays - Choose compressed formats for storage efficiency
- Use memory mapping for arrays larger than RAM
- Consider HDF5 for complex, structured datasets
Explore More
Dive into large-scale data processing, master data serialization techniques, and explore scientific data workflows.
Share this article
Add Comment
No comments yet. Be the first to comment!