️🐍 Polyfactory and Patients

On using Polyfactory to generate representative data.

2024-12-03

#python #polyfactory

PyData Leeds was recently fortunate enough to play host to the "NHS Data Community Takeover". To preface things with a quote from that blog post:

Unfortunately, quality test data is non-existent as some (if not most) NHS organisations, especially data that reflects real-life use cases and the depth of complexity the real world represents. This is particularly challenging given the weight of this dataset, it's real people’s healthcare data and it deserves the respect we give it.

Sadly, yours truly was plague-ridden and missed out and, more importantly, didn't get the chance to bang my drum in favour of one of my favourite libraries for just this task: Polyfactory.

Getting things underway with a new project and adding the initial dependency (yes, I'm using Poetry, don't judge):

poetry init
poetry add polyfactory

I originally came across this project a year or two ago, back when it was known as pydantic-factories. That should give some indication as to how I'm going to start things off: with a Pydantic model.

poetry add pydantic

I'll keep things relatively simple and attempt to represent only a single—and heavily simplified—entity, a patient:

from datetime import date

from pydantic import BaseModel, constr


class Patient(BaseModel):
    patient_id: int
    date_of_birth: date | None
    full_name: constr(max_length=70) | None
    title: constr(max_length=35) | None
    forename: constr(max_length=35) | None
    surname: constr(max_length=35) | None

Armed with a Pydantic model which represents an entity for which one would like to produce accurate data, what's next? Time to turn to Polyfactory:

from polyfactory.factories import pydantic_factory


class PatientFactory(pydantic_factory.ModelFactory):
    __model__ = Patient

That's it. Done, dusted; time to celebrate.

To prove the point:

>>> print(PatientFactory().build().model_dump_json(indent=2))
{
  "patient_id": 9428,
  "date_of_birth": "2022-06-24",
  "full_name": "e5",
  "title": null,
  "forename": "eb0f2d8ca6e484819494",
  "surname": null
}

All data match the provided schema, including a variety of missing data. Okay, it may fit the (fairly loose) schema but perhaps it's not entirely representative.

To demonstrate how that factory can be extended to include data which better represents the expected values, one can turn to Faker:

poetry add faker

The PatientFactory model can be extended to make use of some in-built functionality in Faker:

import faker

from polyfactory import Use


gb = faker.Faker("en-GB")

class PatientFactory(pydantic_factory.ModelFactory):
    __model__ = Patient

    title = Use(gb.prefix)
    forename = Use(gb.first_name)
    surname = Use(gb.last_name)

The above makes use of Polyfactory's Use class which takes a callable, in this case three of Faker's name-related functions, to be invoked at build-time.

…which looks like:

>>> print(PatientFactory().build().model_dump_json(indent=2))
{
  "patient_id": 9201,
  "date_of_birth": null,
  "full_name": "aa104",
  "title": "Dr",
  "forename": "Guy",
  "surname": "Mitchell"
}

Great, that's one Dr. Guy Mitchell. However, those values of course can be NULL; it's straightforward enough to wrap those in another function which will optionally return None instead of a value:

import typing


def _random_nullable(f: typing.Callable) -> str | None:
    if not bool(pydantic_factory.ModelFactory.__random__.getrandbits(1)):
        return None
    
    return f()


class PatientFactory(pydantic_factory.ModelFactory):
    __model__ = Patient

    title = Use(_random_nullable, gb.prefix)
    forename = Use(_random_nullable, gb.first_name)
    surname = Use(_random_nullable, gb.last_name)

However, something like full_name now makes no sense; one could simply repeat the above:

full_name = Use(gb.name)

…but the resulting name would be inconsistent with the other name components. Polyfactory has a couple of options here but I'll make use of the post_generated decorator:

from polyfactory.decorators import post_generated


class PatientFactory(pydantic_factory.ModelFactory):
    __model__ = Patient

    title = Use(_random_nullable, gb.prefix)
    forename = Use(_random_nullable, gb.first_name)
    surname = Use(_random_nullable, gb.last_name)

    @post_generated
    @classmethod
    def full_name(
        cls,
        title: str | None,
        forename: str | None,
        surname: str | None,
    ) -> str | None:
        if not (parts := list(filter(None, [title, forename, surname]))):
            return None

        return " ".join(parts)

Here, I've defined a function with both classmethod and post_generated decorators which takes the name component fields as arguments. If they're all None, the resulting full_name must be None, otherwise the components are combined into the final value, resulting in:

>>> print(PatientFactory().build().model_dump_json(indent=2))
{
  "patient_id": 2685,
  "date_of_birth": "2022-06-09",
  "full_name": "Mrs Guy Kaur",
  "title": "Mrs",
  "forename": "Guy",
  "surname": "Kaur"
}

Pleased to meet you, Mrs. Guy Kaur.

To demonstrate one last feature, I'll look at the date_of_birth field. As can be seen above, it's correctly generating (optionally-null) dates. However, Faker has a number of date-related functions:

>>> print("\n".join(f for f in dir(gb) if "date" in f.lower()))
date
date_between
date_between_dates
date_object
date_of_birth
date_this_century
date_this_decade
date_this_month
date_this_year
date_time
date_time_ad
date_time_between
date_time_between_dates
date_time_this_century
date_time_this_decade
date_time_this_month
date_time_this_year
future_date
future_datetime
passport_dates
past_date
past_datetime

…several of which would be more appropriate (past_date, for instance). What, in the spirit of things, I want the range of data to match a distribution?

According to the ONS, the average (median, apparently) age in the UK is 40.7 years (there are later data but there's no convenient headline figure).

With that in mind, I can make use of NumPy:

poetry add numpy

…to provide a normal distribution around that value (making some assumptions about standard deviation):

from datetime import date, timedelta


def _random_date_of_birth() -> date:
    today = date.today()
    peak = today - timedelta(days=365 * 40)
    offset_days = int(numpy.random.normal(loc=0, scale=365 * 15))

    return peak + timedelta(days=offset_days)


class PatientFactory(pydantic_factory.ModelFactory):
    __model__ = Patient

    title = Use(_random_nullable, gb.prefix)
    forename = Use(_random_nullable, gb.first_name)
    surname = Use(_random_nullable, gb.last_name)
    date_of_birth = Use(_random_nullable, _random_date_of_birth)

Functionally, that shouldn't affect the way the data appear but the date_of_birth field will now follow a normal distribution, centred around the likely average age of a patient.

That's it. That's Polyfactory in a nutshell: with a simple Pydantic model (other factories are available), it's possible to generate data which matches that model. Equally, it's possible to incrementally extend it as particular fields—or the relationship between fields—is better understood.

Perhaps the biggest challenge is that it does require that one knows one's data in order to properly model it. However, it does require that one know one's data to properly model it. And how could that possibly be a bad thing?