python logo

Suppose you have a function that returns some URLs in the following way (this example is somewhat contrived, but we’ll use it for demonstration purposes)

def get_page_urls(base_url, no_pages):
    result = []
    for i in range(1, no_pages + 1):
    return result

or its list comprehension equivalent

def get_page_urls(base_url, no_pages):
    result = [f'{base_url}/{str(i)}' for i in range(1, no_pages + 1)]
    return result

If no_pages is very large, calling get_page_urls will be costly in terms of memory usage (in fact, if it is sufficiently large, you will max out your RAM).

This is because all the URLs in result are stored in memory.

In Python, generators provide a solution to this problem.

Instead of storing an entire sequence in memory, a generator allows you to create a sort of virtual, on-demand sequence where at most one value is in memory at any time (when that value is demanded by the user).

Generator here means generator object (also referred to sometimes as generator iterator).

A generator object is created either via a generator function or a generator expression.

A generator function is just a normal function with the keyword yield in the body.

def get_page_urls(base_url, no_pages):
    for i in range(1, no_pages + 1):
        yield f'{base_url}/{str(i)}'

It can be used as follows

base_url = ''
no_pages = 10
g = get_page_urls(base_url, no_pages)  # does not run any code in the body of the function
type(g)  # `g` is of type `generator`
next(g)  # returns ''
next(g)  # returns ''

The equivalent code using a generator expression is

base_url = ''
no_pages = 10
g = (f'{base_url}/{str(i)}' for i in range(1, no_pages + 1))
type(g)  # `g` is of type `generator`
next(g)  # returns ''
next(g)  # returns ''

Using a generator allows you to “pay as you go”, i.e. pay for the memory for the values you need as you go along, rather than paying for the total upfront.

In programming parlance, this is known as lazy evaluation (as supposed to eager evaluation).

Now, let’s suppose get_page_urls does a bit more work. Rather than just returning the URLs, it now returns the response status codes from making an HTTP GET request to each of those URLs, e.g.

import time
import requests

start = time.time()

def get_status_codes(base_url, no_pages):
    result = []
    for i in range(1, no_pages + 1):
        url = f'{base_url}/{str(i)}'
        r = requests.get(url)
        status_code = r.status_code
        print(f'url: {url}, status code: {status_code}')
    return result
base_url = ''
no_pages = 10
status_codes = get_status_codes(base_url, no_pages)
print(f'{time.time() - start:.2f}s')
url:, status code: 200
url:, status code: 200
url:, status code: 200
url:, status code: 200
url:, status code: 200
url:, status code: 200
url:, status code: 200
url:, status code: 200
url:, status code: 200
url:, status code: 200

Because in each iteration of the for loop we make an HTTP GET request (network I/O), this introduces latency as data is transferred across a network.

The problem now, even when no_pages is small, is not memory but running time.

In the above example, 10 requests with no_pages = 10 had a running time of over 8 seconds. For no_pages = 1000000 (one million requests) this would be a running time of over 9 days…

Clearly, this is not scaleable.

Generators to the rescue again?

Unfortunately, generators as we have seen so far cannot be used to solve this problem (generators in Python are syntactically very similar to coroutines, used extensively in the standard library module asyncio to enable asynchronous programming, which can solve the issue).

However, a generator does split up the running time, making it easier to write code in between each request.

It also means if we have a stopping condition, we only incur running time until the condition is met.

import time
import requests

def get_status_codes(base_url, no_pages):
    for i in range(1, no_pages + 1):
        url = f'{base_url}/{str(i)}'
        r = requests.get(url)
        status_code = r.status_code
        print(f'url: {url}, status code: {status_code}')
        yield status_code

base_url = ''
no_pages = 10
g = get_status_codes(base_url, no_pages)  # does not run any code in the body of the function
start = time.time()
next(g) >= 400  # `next(g)` returns 200, continue
print(f'{time.time() - start:.2f}s')  # 0.89s
# do other stuff...
# finished doing other stuff, let's get the next status code
start = time.time()
next(g) >= 400 # `next(g)` returns 503, we are done
print(f'{time.time() - start:.2f}s')  # 0.88s