Harmonious microservices with Python, Serverless, and namespace packages

Our project began with a small ask: build a small Python back-end using AWS Lambda and API Gateway (with just a handful of endpoints to support a few pages) and use Serverless to manage the Cloud Formation setup. Not very involved, not much code, not even much configuration for the deploy.

Next came authentication. In pursuing the joint dreams of DRTW ("don't reinvent the wheel") and WALCAP ("write as little code as possible"), we selected AWS Cognito over other options such as Auth0 or rolling our own light authentication. Cognito has excellent integration with API Gateway and is surprisingly easy to set up and maintain, especially for an AWS product.

Let's pretend our project was called "Purplecat". The repository more or less looked something like...

 purplecat/
├── alembic/
│  └── ...
├── purplecat/
│  ├── api/
│  │  ├── handlers.py
│  │  └── models.py
│  ├── lib/
│  │  ├── aws/
│  │  │  ├── gateway.py
│  │  │  └── s3.py
│  │  └── db.py
│  └── config.py
├── tests/
│  ├── api/
│  │  ├── test_handlers.py
│  │  └── test_models.py
│  ├── lib/
│  │  ├── test_aws.py
│  │  └── test_db.py
│  ├── factories.py
│  └── fixtures.py
├── package.json
├── requirements-dev.txt
├── requirements.txt
├── serverless.yml
└── setup.py
Thank you, exa, for the wonderful tree view

Not too bad; pretty standard Python. The setup.py excluded the tests/ directory, of course, and the serverless.yml file defined all of the Lambda functions, API Gateway endpoints, and Cognito details.

Growing pains

Over the next few months, the API grew. Five endpoints jumped to dozens, SQLAlchemy models abounded, database queries got comically more complex as we added data versioning and user group-based access to different data, and the test suite was starting to get monstrous.

S3 buckets popped up for a variety of different purposes. We quickly broke the 200 resource limit on Cloud Formation and had to split the service into nested stacks.

Then we realized that we wanted to manage Gateway and Lambda independently from Cognito and S3, mostly so we wouldn't lose all user accounts if we tore down and re-deployed our API stack; a single serverless.yml file split into three. Still not the end of the world (or maintainability), though

Next came more endpoints and some non-API-related ETL pipelines, along with another Serverless file to manage Step Functions state machines and more Lambdas. Everything still sat in the same repo and same Python source directory. We started cooking spaghetti code - ETL imported from the API, and the API started importing from ETL. Different services began to require their own schemas and tables in the database, but they relied on the same alembic configuration and revision history.

Even worse, our serverless.yml files were handpicking directories of Python source to include in packaging (because why should a Lambda have all the code in the entire app?), so each included a block like...

# Only include necessary files
# See "Optimizing packaging time": https://github.com/UnitedIncome/serverless-python-requirements
package:
  individually: false
  include:
    - "!./**"
    - "purplecat/__init__.py"
    - "purplecat/config.py"
    - "purplecat/lib/**"
    - "purplecat/api/**"
  exclude:
    - "**"
So much repetitive config

Everything in this block was identical between Serverless configs except for the last include directory, which was the service-specific code. Forgetting to add a common file or directory would not cause a failure until the Lambda actually executed. And given that Cloud Formation stacks update with the rapidity of a dying, limbless tortoise, those were especially fun to fix. So... yay?

It was time to pay down some debt before it got out of hand, especially given the looming requirements on the roadmap.

Namespace packages + Services = Microlibs

We explored a few ways of reorganizing both the Python code as well as the Serverless stacks. We considered just cleaning up the existing code a bit. We thought about multiple repositories. Ultimately, we found this inspiring article by Jorge Herrera on setting up "microlibs" via Python's namespace packages.

(Note: If you're like me and find Python docs to be comically opaque, the high-level view of namespace packages is that they provide a convenient way to have small, interrelated, separately installable packages living within the same repository. Each package will also only contain the code that it explicitly requires as a dependency. A wonderful system, despite looking something like Java's directory hell.)

We had an identical list of requirements so we decided to give it a go, adapting it to include our Serverless configurations. Any code imported by more than a single service was designated 'common', and no service was allowed to import from another. Moreover, the Python source (if  present) that supported a given Cloud Formation stack lived alongside the serverless.yml file that would deploy it.

We wound up with a structure something like the following (just much larger).

purplecat/
├── common/
│  ├── alembic/
│  │  └── ...
│  ├── purplecat/
│  │  └── common/
│  │     ├── aws/
│  │     │  ├── gateway.py
│  │     │  └── s3.py
│  │     ├── db.py
│  │     └── models.py
│  ├── tests/
│  │  └── ...
│  ├── requirements-dev.txt
│  ├── requirements.txt
│  └── setup.py
└── service/
   ├── api/
   │  ├── alembic/
   │  │  └── ...
   │  ├── purplecat/
   │  │  └── service/
   │  │     └── api/
   │  │        ├── handlers.py
   │  │        └── models.py
   │  ├── tests/
   │  │  └── ...
   │  ├── package.json
   │  ├── requirements-dev.txt
   │  ├── requirements.txt
   │  ├── serverless.yml
   │  └── setup.py
   ├── auth/
   │  ├── package.json
   │  └── serverless.yml
   └── etl/
      ├── alembic/
      │  └── ...
      ├── purplecat/
      │  └── service/
      │     └── etl/
      │        ├── lambdas.py
      │        ├── models.py
      │        └── step_functions.py
      ├── tests/
      │  └── ...
      ├── package.json
      ├── requirements.txt
      ├── serverless.yml
      └── setup.py
The greatest annoyance with Python's namespace packages is the repetitive directory structures

Each service had its own isolated Serverless config as well as a package.json that let each set up its own scripts. Each service also had its own setup.py  that specified it was part of the greater namespace package...

from setuptools import setup

PACKAGE = "purplecat.service.api"

DEV = []
with open("requirements-dev.txt") as reqs:
    DEV = [l.strip() for l in reqs]

setup(
    name=PACKAGE,
    version='1.0.0',
    packages=[PACKAGE],
    namespace_packages=["purplecat"],
    install_requires=[
        "purplecat.common",
        "requests",
        # ...
    ],
    extras_require={"dev": DEV}
)
service/api/setup.py

And each service had a requirements.txt file that specified the location of that common package...

--index-url https://pypi.python.org/simple
../../common
-e .
service/api/requirements.txt

The directory structure was more complex, but...

  • ... the services were far easier to isolate, package, and deploy
  • ... tests could be confined to their respective service and parallelized
  • ... the project structure ultimately had less cognitive overhead
  • ... separating concerns (even on database schemas & migrations) enabled multiple teams to work on the same project more seamlessly
  • ... no service exceeded Cloud Formation resource limits, so the complication of nested/split stacks was no longer necessary
  • ... the serverless-python-requirements plugin could do its default packaging without us specifying what directories to include or exclude, and Lambdas still only got what they needed

Perhaps, best yet, this layout provided us an extensible scaffold with a single, logical way to introduce additional services, which we soon had to do.

Thoughts, questions, corrections, or criticisms? I'd love to hear them!