Profiling and Performance in Python
Before optimizing code, measure. Guessing at bottlenecks wastes time and often makes code worse. Python provides several tools for measuring where time actually goes.
timeit โ Micro-Benchmarking
timeit runs a small snippet many times and reports wall-clock time. It disables garbage collection by default so GC pauses don't skew results.
The setup argument runs once before the timed loop โ use it to import modules or create data that the snippet needs without including that time in the measurement:
timeit.repeat()
repeat(stmt, repeat=5, number=1000) runs number iterations repeat times and returns a list of totals. Use min() of the list โ the minimum is the most stable estimate of "best case" (lower is fewer OS interruptions):
timeit from the Command Line
cProfile โ Full Program Profiling
cProfile is a deterministic profiler โ it instruments every function call. Use it to find which functions consume the most time in a real program.
Sample output:
Columns explained:
| Column | Meaning |
|---|---|
ncalls | Total calls (recursive shown as 2692537/1 = total/primitive) |
tottime | Time spent inside this function, excluding subcalls |
percall | tottime รท ncalls |
cumtime | Total time including all subcalls |
percall (2nd) | cumtime รท primitive calls |
Look at tottime to find functions that are slow in themselves; look at cumtime to find the most expensive call trees.
pstats โ Sorting Profile Output
pstats.Stats lets you load, sort, and filter profile data programmatically:
Sort keys: "tottime", "cumtime", "ncalls", "filename", "name".
memory_profiler โ Concept (Not in Stdlib)
memory_profiler is a third-party package (pip install memory-profiler) that reports per-line memory usage. It uses the @profile decorator (which is injected by the runner, not imported):
For a quick size estimate without installing anything, use sys.getsizeof:
Practical Python Performance Tips
1. Use built-in functions over manual loops
Built-ins like sum(), map(), filter(), min(), max() are implemented in C and run faster than equivalent Python for-loops:
2. Avoid global variable lookups โ cache as local
Python looks up global names in the module's __dict__ on every access. Inside a tight loop, binding a global to a local variable eliminates that overhead:
3. List comprehensions vs for-loops
List comprehensions are compiled to a faster bytecode path than equivalent for-loops with .append():
4. Avoid repeated attribute lookup
Each obj.method expression traverses the attribute chain. Cache it:
5. slots for memory-dense objects
By default, Python instances store attributes in a __dict__. Declaring __slots__ replaces that with a fixed-size array, cutting memory by ~40-50% for many small objects:
6. lru_cache for expensive pure functions
functools.lru_cache memoizes calls so repeated inputs return cached results without re-computing:
7. NumPy for numeric work
For array arithmetic, NumPy's vectorized operations run in C and are 10-100ร faster than Python loops:
Big-O of Common Operations
| Operation | Complexity | Notes |
|---|---|---|
list.append() | O(1) amortized | Occasional resize is O(n) |
list.insert(i, x) | O(n) | Shifts all elements after index |
list[i] | O(1) | Index access |
list.pop() | O(1) | Pop from end |
list.pop(0) | O(n) | Use collections.deque instead |
x in list | O(n) | Linear scan |
x in set | O(1) avg | Hash lookup |
dict[key] | O(1) avg | Hash lookup |
sorted(n) | O(n log n) | Timsort |
bisect.bisect | O(log n) | Binary search |
heapq.heappush | O(log n) | Heap sift |
heapq.heapify | O(n) | Build heap from list |
Knowledge Check
In cProfile output, `tottime` represents:
Why does `timeit` disable the garbage collector by default during measurement?
What is the time complexity of `x in some_set` for a Python set?