️🐍 Polyfactory and Patients
On using Polyfactory to generate representative data.
2024-12-03
PyData Leeds was recently fortunate enough to play host to the "NHS Data Community Takeover". To preface things with a quote from that blog post:
Unfortunately, quality test data is non-existent as some (if not most) NHS organisations, especially data that reflects real-life use cases and the depth of complexity the real world represents. This is particularly challenging given the weight of this dataset, it's real people’s healthcare data and it deserves the respect we give it.
Sadly, yours truly was plague-ridden and missed out and, more importantly, didn't get the chance to bang my drum in favour of one of my favourite libraries for just this task: Polyfactory.
Getting things underway with a new project and adding the initial dependency (yes, I'm using Poetry, don't judge):
poetry init
poetry add polyfactory
I originally came across this project a year or two ago, back when it was known
as pydantic-factories
. That
should give some indication as to how I'm going to start things off: with a
Pydantic model.
poetry add pydantic
I'll keep things relatively simple and attempt to represent only a single—and heavily simplified—entity, a patient:
from datetime import date
from pydantic import BaseModel, constr
class Patient(BaseModel):
patient_id: int
date_of_birth: date | None
full_name: constr(max_length=70) | None
title: constr(max_length=35) | None
forename: constr(max_length=35) | None
surname: constr(max_length=35) | None
Armed with a Pydantic model which represents an entity for which one would like to produce accurate data, what's next? Time to turn to Polyfactory:
from polyfactory.factories import pydantic_factory
class PatientFactory(pydantic_factory.ModelFactory):
__model__ = Patient
That's it. Done, dusted; time to celebrate.
To prove the point:
>>> print(PatientFactory().build().model_dump_json(indent=2))
{
"patient_id": 9428,
"date_of_birth": "2022-06-24",
"full_name": "e5",
"title": null,
"forename": "eb0f2d8ca6e484819494",
"surname": null
}
All data match the provided schema, including a variety of missing data. Okay, it may fit the (fairly loose) schema but perhaps it's not entirely representative.
To demonstrate how that factory can be extended to include data which better represents the expected values, one can turn to Faker:
poetry add faker
The PatientFactory
model can be extended to make use of some in-built
functionality in Faker:
import faker
from polyfactory import Use
gb = faker.Faker("en-GB")
class PatientFactory(pydantic_factory.ModelFactory):
__model__ = Patient
title = Use(gb.prefix)
forename = Use(gb.first_name)
surname = Use(gb.last_name)
The above makes use of Polyfactory's
Use
class
which takes a callable, in this case three of Faker's name-related functions, to
be invoked at build-time.
…which looks like:
>>> print(PatientFactory().build().model_dump_json(indent=2))
{
"patient_id": 9201,
"date_of_birth": null,
"full_name": "aa104",
"title": "Dr",
"forename": "Guy",
"surname": "Mitchell"
}
Great, that's one Dr. Guy Mitchell. However, those values of course can be
NULL
; it's straightforward enough to wrap those in another function which
will optionally return None
instead of a value:
import typing
def _random_nullable(f: typing.Callable) -> str | None:
if not bool(pydantic_factory.ModelFactory.__random__.getrandbits(1)):
return None
return f()
class PatientFactory(pydantic_factory.ModelFactory):
__model__ = Patient
title = Use(_random_nullable, gb.prefix)
forename = Use(_random_nullable, gb.first_name)
surname = Use(_random_nullable, gb.last_name)
However, something like full_name
now makes no sense; one could simply
repeat the above:
full_name = Use(gb.name)
…but the resulting name would be inconsistent with the other name components.
Polyfactory has a couple of options here but I'll make use of the
post_generated
decorator:
from polyfactory.decorators import post_generated
class PatientFactory(pydantic_factory.ModelFactory):
__model__ = Patient
title = Use(_random_nullable, gb.prefix)
forename = Use(_random_nullable, gb.first_name)
surname = Use(_random_nullable, gb.last_name)
@post_generated
@classmethod
def full_name(
cls,
title: str | None,
forename: str | None,
surname: str | None,
) -> str | None:
if not (parts := list(filter(None, [title, forename, surname]))):
return None
return " ".join(parts)
Here, I've defined a function with both classmethod
and post_generated
decorators which takes the name component fields as arguments. If they're all
None
, the resulting full_name
must be None
, otherwise the components are
combined into the final value, resulting in:
>>> print(PatientFactory().build().model_dump_json(indent=2))
{
"patient_id": 2685,
"date_of_birth": "2022-06-09",
"full_name": "Mrs Guy Kaur",
"title": "Mrs",
"forename": "Guy",
"surname": "Kaur"
}
Pleased to meet you, Mrs. Guy Kaur.
To demonstrate one last feature, I'll look at the date_of_birth
field. As can
be seen above, it's correctly generating (optionally-null) dates. However,
Faker has a number of date-related functions:
>>> print("\n".join(f for f in dir(gb) if "date" in f.lower()))
date
date_between
date_between_dates
date_object
date_of_birth
date_this_century
date_this_decade
date_this_month
date_this_year
date_time
date_time_ad
date_time_between
date_time_between_dates
date_time_this_century
date_time_this_decade
date_time_this_month
date_time_this_year
future_date
future_datetime
passport_dates
past_date
past_datetime
…several of which would be more appropriate (past_date
, for instance). What,
in the spirit of things, I want the range of data to match a distribution?
According to the ONS, the average (median, apparently) age in the UK is 40.7 years (there are later data but there's no convenient headline figure).
With that in mind, I can make use of NumPy:
poetry add numpy
…to provide a normal distribution around that value (making some assumptions about standard deviation):
from datetime import date, timedelta
def _random_date_of_birth() -> date:
today = date.today()
peak = today - timedelta(days=365 * 40)
offset_days = int(numpy.random.normal(loc=0, scale=365 * 15))
return peak + timedelta(days=offset_days)
class PatientFactory(pydantic_factory.ModelFactory):
__model__ = Patient
title = Use(_random_nullable, gb.prefix)
forename = Use(_random_nullable, gb.first_name)
surname = Use(_random_nullable, gb.last_name)
date_of_birth = Use(_random_nullable, _random_date_of_birth)
Functionally, that shouldn't affect the way the data appear but the
date_of_birth
field will now follow a normal distribution, centred around the
likely average age of a patient.
That's it. That's Polyfactory in a nutshell: with a simple Pydantic model (other factories are available), it's possible to generate data which matches that model. Equally, it's possible to incrementally extend it as particular fields—or the relationship between fields—is better understood.
Perhaps the biggest challenge is that it does require that one knows one's data in order to properly model it. However, it does require that one know one's data to properly model it. And how could that possibly be a bad thing?