Data Loading

mGrowthCtrl provides the class Dataloader for loading experimental data from CSV files and from the mGrowthDB database. See below for the reference documentation on a dataloader object’s properties and methods.

The loaded data can be used directly with CRModel (see Consumer-Resource Model).

Loading from CSV

To download data from a local file, you can use the load_local_data method of the Dataloader class. You can provide a pandas dataframe you’ve built up however you like, or the path of a CSV file that the method will try to parse with pandas’ default settings.

The method will try to interpret a CSV’s columns, but ideally you should provide “selector” methods:

  • Time (time_col): You can provide a single time column as a string or a list of column name candidates. By default, the function will attempt names from the following list: “Time (hours)”, “Time”, “time”, “t”. The found column will be stored as the time_col attribute.

  • X columns (consumers, x_selector): A lambda, a regex, or a list of specific names that determines which column is a consumer, for example lambda c: c.startswith("NCBI:"). If this is omitted, consumer columns will be considered to be the ones that don’t match the s_selector.

  • S columns (substrates, s_selector): A lambda or a regex that marks substrates. If this is omitted, substrate columns will be considered to be the ones that remain after consumers are found with the x_selector.

  • Error columns (err_selector): A lambda or a regex that detects error columns. At this time, they’ll only be used when plotting to show the data’s uncertainty, and won’t be used in modeling. If given as a regex, it needs to have a single capture group that returns the target column it is the error of. For instance, if we have the columns “Bacteroides” and “Bacteroides STD”, we could use the selector r'(.*) STD$'. If the selector is a function, it needs to return the target column or None.

At least one of x_selector or s_selector need to be provided or the method will raise an exception. The err_selector is optional and only useful if you have a dataset with standard deviation columns for each value column that follow a standardized scheme.

The dataloader will store the cleaned data frame, the time column as a numpy array, the Y values and a ModelNames object with the detected column names.

Example snippet:

from mgrowthctrl.utils.data import Dataloader

# Load experimental data, detecting names of consumer and resource columns:
data = Dataloader()
data.load_local_data("examples/datasets/bhbtri_3species.csv", x_selector=r"Cells")

print(data.names)
print(data.df)

If you have a dataset that includes error values, the loader expects that these columns are labeled in a consistent way. For instance, the example file BT_WC_export_average.csv uses the string “STD” to indicate standard deviation columns, located between the name of the measured target and the column units. The value column “BT counts” has an error column “BT counts STD”, while the value column “acetate (mM)” has an error column “acetate STD (mM)”. We can use the following code to correctly detect value columns and error columns:

from mgrowthctrl.utils.data import Dataloader

data = Dataloader()
data.load_local_data(
    "examples/datasets/BT_WC_export_average.csv",

    # Use list of strings to name the single X column directly
    x_selector=['BT counts'],

    # Use a lambda to find the " STD" string anywhere in the column and return
    # the STD-less string as the target value column:
    err_selector=lambda col: col.replace(' STD', '') if ' STD' in col else None,
)

print(data.X_names)
#=> ['BT counts']
print(data.X_err_names)
#=> ['BT counts STD']

print(data.S_names)
#=> ['acetate (mM)', 'formate (mM)', ...]
print(data.S_err_names)
#=> ['acetate STD (mM)', 'formate STD (mM)', ...]

Loading from mGrowthDB

Instead of loading data from a local file and manually describing which columns are consumers and which are resources, you can fetch data from an mGrowthDB experiment by using the load_remote_experiment method. Because the remote source provides metadata for the experiments, you don’t need to explicitly provide column name selectors.

Example snippet that fetches the data of experiment EMGDB000000020:

from mgrowthctrl.utils.data import Dataloader

# Load experimental data, detecting names of consumer and resource columns:
data = Dataloader()
data.load_remote_experiment("EMGDB000000020")

print(data.names)
print(data.df)

By default, the loader will attempt to download the computed “average” bioreplicate of that experiment. If there is no average bioreplicate, the first one will be downloaded. If you’d like to pick a specific bioreplicate, you can select it by the name you can find on the web application:

# Load experimental data, detecting names of consumer and resource columns:
data = Dataloader()
data.load_remote_experiment("EMGDB000000020", bioreplicate_name="BT_WC_2")

print(data._remote_bioreplicate_metadata["name"])
#=> BT_WC_2

For debugging purposes, you could access the JSON response from the mGrowthDB API in the _remote_bioreplicate_metadata field. To learn more about the API and/or use it directly, you can consult the mGrowthDB API documentation.

API Documentation

mgrowthctrl.utils.data.DEFAULT_ROOT_URL = 'https://mgrowthdb.gbiomed.kuleuven.be'

By default, remote data is fetched from the mGrowthDB database

class mgrowthctrl.utils.data.Dataloader(root_url='https://mgrowthdb.gbiomed.kuleuven.be', api_key=None)

Class that encapsulates data and metadata that is ready for fitting to a CR model. After instantiating, you need to invoke load_local_data or load_remote_experiment to load some data source into the object’s properties.

Properties:

df:

The parsed dataframe

time_col:

The name of the time column in the dataframe

names:

A ModelNames object that includes X and S lists of consumer and substrate column names, respectively

t:

A numpy array of the time values

Y:

A numpy array of the measurements

property X_names

A shorthand to fetch the consumer (X) column names of the dataframe

property S_names

A shorthand to fetch the substrate (S) column names of the dataframe

property X_err_names

A shorthand to fetch the consumer (X) error column names of the dataframe

property S_err_names

A shorthand to fetch the substrate (S) error column names of the dataframe

property y0

A shorthand to fetch the initial measurement values, e.g. for prediction purposes

property start_time: float

The first time value

property end_time: float

The final time value

load_local_data(
data: DataFrame | str | Path,
*,
time_col: str | Iterable[str] = ('Time (hours)', 'Time', 'time', 't'),
x_selector: str | Iterable[str] | Callable[[str], bool] | None = None,
s_selector: str | Iterable[str] | Callable[[str], bool] | None = None,
err_selector: str | Callable[[str], str | None] | None = None,
require_monotonic_time: bool = True,
clip_nonneg: float | None = 0.0,
)

Process a pandas dataframe or a CSV file that will be parsed into a dataframe. The given X and S selectors will be used to determine which columns represent consumers (X) and substrates (S). At least one of the two selectors needs to be provided.

Parameters:
  • data – Either a pandas DataFrame, or a file path to open

  • time_col – The name of the time column or a list of names to try

  • x_selector – A regex, a list of values, or a callable that determines which column represents a consumer.

  • s_selector – A regex, a list of values, or a callable that determines which column represents a substrate.

  • err_selector – A regex or a callable that determines which column represents measurement error. If it’s a regex, it needs to have a single capture group that returns the column it refers to, for example: r’(.*) STD$’. If it’s a callable, it needs to return the column it refers to or None

  • require_monotonic_time – Whether to check if the values in the time column are monotonic

  • clip_nonneg – A floor value that the values are clipped to, defaults to 0.0, meaning that negative values are truncated to 0.

load_remote_experiment(
experiment_id: str,
*,
prefer_average: bool = True,
bioreplicate_name: str | None = None,
time_round: int | None = 3,
include_other: bool = False,
keep_predicate: Callable[[dict], bool] | None = None,
clip_nonneg: float | None = 0.0,
require_monotonic_time: bool = True,
)

Download an experiment’s data from mGrowthDB and detect its consumer and substrate columns. By default, looks for a bioreplicate in the experiment marked as “average”, otherwise fetches data from the first bioreplicate. If given an explicit bioreplicate name that can be found on mGrowthDB, will select that replicate instead.

Parameters:
  • experiment_id – Stable experiment identifier of the form EMGDBxxx

  • prefer_average – Whether to look for the average bioreplicate or not

  • bioreplicate_name – If given, will select a specific bioreplicate by its name

  • time_round – The number of digits to round time values to

  • include_other – Whether to include measurements from unknown techniques

  • keep_predicate – A callable that is invoked with the metadata of each measurement context and decides whether to keep it in the dataframe or not

  • clip_nonneg – A floor value that the values are clipped to defaults to 0.0, meaning that negative values are truncated to 0