Data Loading

mGrowthCtrl provides the class Dataloader for loading experimental data from CSV files, or from the mGrowthDB database. See below for the reference documentation on a dataloader object’s properties and methods.

The loaded data can be used directly with CRModel (see Consumer-Resource Model). You can also use a remote connection with mGrowthDB to upload and visualize fitted data.

A video demonstration of using the dataloader can be found on the mGrowthDB YouTube channel: “Workspaces: API integration”.

Loading from CSV

To download data from a local file, you can use the load_local_data method of the Dataloader class. You can provide a pandas dataframe you’ve built up however you like, or the path of a CSV file that the method will try to parse with pandas’ default settings.

The method will try to interpret a CSV’s columns, but ideally you should provide “selector” methods:

Time (time_col): You can provide a single time column as a string or a list of column name candidates. By default, the function will attempt names from the following list: “Time (hours)”, “Time”, “time”, “t”. The found column will be stored as the time_col attribute.
X columns (consumers, x_selector): A lambda, a regex, or a list of specific names that determines which column is a consumer, for example lambda c: c.startswith("NCBI:"). If this is omitted, consumer columns will be considered to be the ones that don’t match the s_selector.
S columns (substrates, s_selector): A lambda or a regex that marks substrates. If this is omitted, substrate columns will be considered to be the ones that remain after consumers are found with the x_selector.
Error columns (err_selector): A lambda or a regex that detects error columns. At this time, they’ll only be used when plotting to show the data’s uncertainty, and won’t be used in modeling. If given as a regex, it needs to have a single capture group that returns the target column it is the error of. For instance, if we have the columns “Bacteroides” and “Bacteroides STD”, we could use the selector r'(.*) STD$'. If the selector is a function, it needs to return the target column or None.

At least one of x_selector or s_selector need to be provided or the method will raise an exception. The err_selector is optional and only useful if you have a dataset with standard deviation columns for each value column that follow a standardized scheme.

The dataloader will store the cleaned data frame, the time column as a numpy array, the Y values and a ModelNames object with the detected column names.

Example snippet:

from mgrowthctrl.utils.data import Dataloader

# Load experimental data, detecting names of consumer and resource columns:
data = Dataloader()
data.load_local_data("examples/datasets/bhbtri_3species.csv", x_selector=r"Cells")

print(data.names)
print(data.df)

If you have a dataset that includes error values, the loader expects that these columns are labeled in a consistent way. For instance, the example file BT_WC_export_average.csv uses the string “STD” to indicate standard deviation columns, located between the name of the measured target and the column units. The value column “BT counts” has an error column “BT counts STD”, while the value column “acetate (mM)” has an error column “acetate STD (mM)”. We can use the following code to correctly detect value columns and error columns:

from mgrowthctrl.utils.data import Dataloader

data = Dataloader()
data.load_local_data(
    "examples/datasets/BT_WC_export_average.csv",

    # Use list of strings to name the single X column directly
    x_selector=['BT counts'],

    # Use a lambda to find the " STD" string anywhere in the column and return
    # the STD-less string as the target value column:
    err_selector=lambda col: col.replace(' STD', '') if ' STD' in col else None,
)

print(data.X_names)
#=> ['BT counts']
print(data.X_err_names)
#=> ['BT counts STD']

print(data.S_names)
#=> ['acetate (mM)', 'formate (mM)', ...]
print(data.S_err_names)
#=> ['acetate STD (mM)', 'formate STD (mM)', ...]

Loading from mGrowthDB

Instead of loading data from a local file and manually describing which columns are consumers and which are resources, you can fetch data from an mGrowthDB experiment by using the load_remote_experiment method. Because the remote source provides metadata for the experiments, you don’t need to explicitly provide column name selectors.

Example snippet that fetches the data of experiment EMGDB000000020:

from mgrowthctrl.utils.data import Dataloader

# Load experimental data, detecting names of consumer and resource columns:
data = Dataloader()
data.load_remote_experiment("EMGDB000000020")

print(data.names)
print(data.df)

By default, the loader will attempt to download the computed “average” bioreplicate of that experiment. If there is no average bioreplicate, the first one will be downloaded. If you’d like to pick a specific bioreplicate, you can select it by the name you can find on the web application:

# Load experimental data, detecting names of consumer and resource columns:
data = Dataloader()
data.load_remote_experiment("EMGDB000000020", bioreplicate_name="BT_WC_2")

print(data._remote_bioreplicate_metadata["name"])
#=> BT_WC_2

For debugging purposes, you could access the JSON response from the mGrowthDB API in the _remote_bioreplicate_metadata field. To learn more about the API and/or use it directly, you can consult the mGrowthDB API documentation.

Pushing data to mGrowthDB

If you initialize a dataloader instance with an API key for mGrowthDB, you can push data to a “Workspace” on the application. This workspace is scoped to your user, and it can be private to you (useful for visualization purposes) or it can be made publicly available (useful for sharing data and model predictions with other people). Here is an example of how you can load some local data, fit it (see Consumer-Resource Model), and then push it to the application for visualization purposes:

import numpy as np
from mgrowthctrl.utils.data import Dataloader
from mgrowthctrl.models import CRModel, CRModelParams

# Load experimental data, detecting names of consumer and resource columns:
data = Dataloader(api_key="[your-api-key]")
data.load_local_data("examples/datasets/BT_WC_export.csv", x_selector=r"BT counts")

# Fit data by using least-squares fitting
model = CRModel.from_single_species_data(
    df=data.df,
    time_col=data.time_col,
    x_col=data.X_names[0],  # Note: expects a single species string here
    s_cols=data.S_names,
)
model.fit(
    df=data.df,
    time_col=data.time_col,
    x_cols=data.X_names,
    s_cols=data.S_names,
)

# Simulate 200 time points:
t_sim = np.linspace(data.start_time, data.end_time, num=200)
sim = model.simulate(data.y0, t_sim)

# Push data and get the visualization URL from the response:
response = data.push_remote_data(
    workspace="API test",
    sim=sim,
    x_units='Cells/μL',
    s_units='mM',
)

print(response['workspaceVisualizeUrl'])

All the parameters given to push_remote_data are optional. You can read the details on how to invoke the function below in the API documentation, and you can learn more about workspaces and how to use them from the mGrowthDB API documentation on the topic.

API Documentation

mgrowthctrl.utils.data.DEFAULT_ROOT_URL = 'https://mgrowthdb.gbiomed.kuleuven.be': By default, remote data is fetched from the mGrowthDB database.

class mgrowthctrl.utils.data.Dataloader(root_url='https://mgrowthdb.gbiomed.kuleuven.be', api_key=None)

Class that encapsulates data and metadata that is ready for fitting to a CR model.

After instantiating, you need to invoke load_local_data or load_remote_experiment to load some data source into the object’s properties.

Parameters:

root_url – An optional custom URL for remote fetching.
api_key – An optional API key to enable pushing data to the remote source.

Properties:

df:: The parsed dataframe
time_col:: The name of the time column in the dataframe
names:: A ModelNames object that includes X and S lists of consumer and substrate column names, respectively
t:: A numpy array of the time values
Y:: A numpy array of the measurements

property X_names: A shorthand to fetch the consumer (X) column names of the dataframe

property S_names: A shorthand to fetch the substrate (S) column names of the dataframe

property X_err_names: A shorthand to fetch the consumer (X) error column names of the dataframe

property S_err_names: A shorthand to fetch the substrate (S) error column names of the dataframe

property y0: A shorthand to fetch the initial measurement values, e.g. for prediction purposes

property start_time: float: The first time value

property end_time: float: The final time value

Process a pandas dataframe or a CSV file that will be parsed into a dataframe. The given X and S selectors will be used to determine which columns represent consumers (X) and substrates (S). At least one of the two selectors needs to be provided.

Parameters:

data – Either a pandas DataFrame, or a file path to open
time_col – The name of the time column or a list of names to try
x_selector – A regex, a list of values, or a callable that determines which column represents a consumer.
s_selector – A regex, a list of values, or a callable that determines which column represents a substrate.
err_selector – A regex or a callable that determines which column represents measurement error. If it’s a regex, it needs to have a single capture group that returns the column it refers to, for example: r’(.*) STD$’. If it’s a callable, it needs to return the column it refers to or None
require_monotonic_time – Whether to check if the values in the time column are monotonic
clip_nonneg – A floor value that the values are clipped to, defaults to 0.0, meaning that negative values are truncated to 0.

load_remote_experiment( experiment_id: str, *, prefer_average: bool = True, bioreplicate_name: str | None = None, time_round: int | None = 3, include_other: bool = False, keep_predicate: Callable[[dict], bool] | None = None, clip_nonneg: float | None = 0.0, require_monotonic_time: bool = True, )

Download an experiment’s data from mGrowthDB and detect its consumer and substrate columns. By default, looks for a bioreplicate in the experiment marked as “average”, otherwise fetches data from the first bioreplicate. If given an explicit bioreplicate name that can be found on mGrowthDB, will select that replicate instead.

Parameters:

experiment_id – Stable experiment identifier of the form EMGDBxxx
prefer_average – Whether to look for the average bioreplicate or not
bioreplicate_name – If given, will select a specific bioreplicate by its name
time_round – The number of digits to round time values to
include_other – Whether to include measurements from unknown techniques
keep_predicate – A callable that is invoked with the metadata of each measurement context and decides whether to keep it in the dataframe or not
clip_nonneg – A floor value that the values are clipped to defaults to 0.0, meaning that negative values are truncated to 0

push_remote_data(workspace='default', sim=None, x_units=None, s_units=None, append=False)

Push the data stored in the object to a Workspace on mGrowthDB.

This will only work if the object was instantiated with an api_key argument, otherwise it will raise an error. You can provide a “simulation” object coming from simulate calls to also upload model predictions that are matched up to the data.

When passing along units, consult the mGrowthDB documentation for valid unit names. If you send an unsupported unit value, it will be recorded as metadata, but it may not be comparable to data elsewhere on the site.

Parameters:

workspace – The name of the workspace to push to. If it doesn’t exist, will be created. If it’s not passed along, the data is uploaded to the default workspace.
sim – An object with simulation data.
x_units – Units for growth (consumer) data.
s_units – Units for metabolite (substrate) data.

Returns:

An dictionary with the following keys: - workspaceUrl: The main URL of the workspace where the data was uploaded - workspaceVisualizeUrl: The URL of the visualization page - workspaceEntryIds: A list of workspace entry IDs you can use to fetch the data via the mGrowthDB API