Data Loading
mGrowthCtrl provides the class Dataloader for loading experimental data from CSV files and from the mGrowthDB database. See below for the reference documentation on a dataloader object’s properties and methods.
The loaded data can be used directly with CRModel (see Consumer-Resource Model).
Loading from CSV
To download data from a local file, you can use the load_local_data method of the Dataloader class. You can provide a pandas dataframe you’ve built up however you like, or the path of a CSV file that the method will try to parse with pandas’ default settings.
The method will try to interpret a CSV’s columns, but ideally you should provide “selector” methods:
Time (
time_col): You can provide a single time column as a string or a list of column name candidates. By default, the function will attempt names from the following list: “Time (hours)”, “Time”, “time”, “t”. The found column will be stored as thetime_colattribute.X columns (consumers,
x_selector): A lambda, a regex, or a list of specific names that determines which column is a consumer, for examplelambda c: c.startswith("NCBI:"). If this is omitted, consumer columns will be considered to be the ones that don’t match thes_selector.S columns (substrates,
s_selector): A lambda or a regex that marks substrates. If this is omitted, substrate columns will be considered to be the ones that remain after consumers are found with thex_selector.Error columns (
err_selector): A lambda or a regex that detects error columns. At this time, they’ll only be used when plotting to show the data’s uncertainty, and won’t be used in modeling. If given as a regex, it needs to have a single capture group that returns the target column it is the error of. For instance, if we have the columns “Bacteroides” and “Bacteroides STD”, we could use the selectorr'(.*) STD$'. If the selector is a function, it needs to return the target column orNone.
At least one of x_selector or s_selector need to be provided or the method will raise an exception. The err_selector is optional and only useful if you have a dataset with standard deviation columns for each value column that follow a standardized scheme.
The dataloader will store the cleaned data frame, the time column as a numpy array, the Y values and a ModelNames object with the detected column names.
Example snippet:
from mgrowthctrl.utils.data import Dataloader
# Load experimental data, detecting names of consumer and resource columns:
data = Dataloader()
data.load_local_data("examples/datasets/bhbtri_3species.csv", x_selector=r"Cells")
print(data.names)
print(data.df)
If you have a dataset that includes error values, the loader expects that these columns are labeled in a consistent way. For instance, the example file BT_WC_export_average.csv uses the string “STD” to indicate standard deviation columns, located between the name of the measured target and the column units. The value column “BT counts” has an error column “BT counts STD”, while the value column “acetate (mM)” has an error column “acetate STD (mM)”. We can use the following code to correctly detect value columns and error columns:
from mgrowthctrl.utils.data import Dataloader
data = Dataloader()
data.load_local_data(
"examples/datasets/BT_WC_export_average.csv",
# Use list of strings to name the single X column directly
x_selector=['BT counts'],
# Use a lambda to find the " STD" string anywhere in the column and return
# the STD-less string as the target value column:
err_selector=lambda col: col.replace(' STD', '') if ' STD' in col else None,
)
print(data.X_names)
#=> ['BT counts']
print(data.X_err_names)
#=> ['BT counts STD']
print(data.S_names)
#=> ['acetate (mM)', 'formate (mM)', ...]
print(data.S_err_names)
#=> ['acetate STD (mM)', 'formate STD (mM)', ...]
Loading from mGrowthDB
Instead of loading data from a local file and manually describing which columns are consumers and which are resources, you can fetch data from an mGrowthDB experiment by using the load_remote_experiment method. Because the remote source provides metadata for the experiments, you don’t need to explicitly provide column name selectors.
Example snippet that fetches the data of experiment EMGDB000000020:
from mgrowthctrl.utils.data import Dataloader
# Load experimental data, detecting names of consumer and resource columns:
data = Dataloader()
data.load_remote_experiment("EMGDB000000020")
print(data.names)
print(data.df)
By default, the loader will attempt to download the computed “average” bioreplicate of that experiment. If there is no average bioreplicate, the first one will be downloaded. If you’d like to pick a specific bioreplicate, you can select it by the name you can find on the web application:
# Load experimental data, detecting names of consumer and resource columns:
data = Dataloader()
data.load_remote_experiment("EMGDB000000020", bioreplicate_name="BT_WC_2")
print(data._remote_bioreplicate_metadata["name"])
#=> BT_WC_2
For debugging purposes, you could access the JSON response from the mGrowthDB API in the _remote_bioreplicate_metadata field. To learn more about the API and/or use it directly, you can consult the mGrowthDB API documentation.
API Documentation
- mgrowthctrl.utils.data.DEFAULT_ROOT_URL = 'https://mgrowthdb.gbiomed.kuleuven.be'
By default, remote data is fetched from the mGrowthDB database
- class mgrowthctrl.utils.data.Dataloader(root_url='https://mgrowthdb.gbiomed.kuleuven.be', api_key=None)
Class that encapsulates data and metadata that is ready for fitting to a CR model. After instantiating, you need to invoke
load_local_dataorload_remote_experimentto load some data source into the object’s properties.Properties:
- df:
The parsed dataframe
- time_col:
The name of the time column in the dataframe
- names:
A
ModelNamesobject that includesXandSlists of consumer and substrate column names, respectively- t:
A numpy array of the time values
- Y:
A numpy array of the measurements
- property X_names
A shorthand to fetch the consumer (X) column names of the dataframe
- property S_names
A shorthand to fetch the substrate (S) column names of the dataframe
- property X_err_names
A shorthand to fetch the consumer (X) error column names of the dataframe
- property S_err_names
A shorthand to fetch the substrate (S) error column names of the dataframe
- property y0
A shorthand to fetch the initial measurement values, e.g. for prediction purposes
- property start_time: float
The first time value
- property end_time: float
The final time value
- load_local_data(
- data: DataFrame | str | Path,
- *,
- time_col: str | Iterable[str] = ('Time (hours)', 'Time', 'time', 't'),
- x_selector: str | Iterable[str] | Callable[[str], bool] | None = None,
- s_selector: str | Iterable[str] | Callable[[str], bool] | None = None,
- err_selector: str | Callable[[str], str | None] | None = None,
- require_monotonic_time: bool = True,
- clip_nonneg: float | None = 0.0,
Process a pandas dataframe or a CSV file that will be parsed into a dataframe. The given X and S selectors will be used to determine which columns represent consumers (X) and substrates (S). At least one of the two selectors needs to be provided.
- Parameters:
data – Either a pandas DataFrame, or a file path to open
time_col – The name of the time column or a list of names to try
x_selector – A regex, a list of values, or a callable that determines which column represents a consumer.
s_selector – A regex, a list of values, or a callable that determines which column represents a substrate.
err_selector – A regex or a callable that determines which column represents measurement error. If it’s a regex, it needs to have a single capture group that returns the column it refers to, for example: r’(.*) STD$’. If it’s a callable, it needs to return the column it refers to or None
require_monotonic_time – Whether to check if the values in the time column are monotonic
clip_nonneg – A floor value that the values are clipped to, defaults to 0.0, meaning that negative values are truncated to 0.
- load_remote_experiment(
- experiment_id: str,
- *,
- prefer_average: bool = True,
- bioreplicate_name: str | None = None,
- time_round: int | None = 3,
- include_other: bool = False,
- keep_predicate: Callable[[dict], bool] | None = None,
- clip_nonneg: float | None = 0.0,
- require_monotonic_time: bool = True,
Download an experiment’s data from mGrowthDB and detect its consumer and substrate columns. By default, looks for a bioreplicate in the experiment marked as “average”, otherwise fetches data from the first bioreplicate. If given an explicit bioreplicate name that can be found on mGrowthDB, will select that replicate instead.
- Parameters:
experiment_id – Stable experiment identifier of the form
EMGDBxxxprefer_average – Whether to look for the average bioreplicate or not
bioreplicate_name – If given, will select a specific bioreplicate by its name
time_round – The number of digits to round time values to
include_other – Whether to include measurements from unknown techniques
keep_predicate – A callable that is invoked with the metadata of each measurement context and decides whether to keep it in the dataframe or not
clip_nonneg – A floor value that the values are clipped to defaults to 0.0, meaning that negative values are truncated to 0