************
Data Loading
************

`mGrowthCtrl` provides the class :class:`~mgrowthctrl.utils.data.Dataloader` for loading experimental data from CSV files and from the `mGrowthDB`_ database. See below for the reference documentation on a dataloader object's properties and methods.

The loaded data can be used directly with :class:`~mgrowthctrl.models.crm.model.CRModel` (see :doc:`../models/consumer_resource_model`).

.. _mGrowthDB: https://mgrowthdb.gbiomed.kuleuven.be/

Loading from CSV
================

To download data from a local file, you can use the ``load_local_data`` method of the :class:`~mgrowthctrl.utils.data.Dataloader` class. You can provide a pandas dataframe you've built up however you like, or the path of a CSV file that the method will try to parse with pandas' default settings.

The method will try to interpret a CSV's columns, but ideally you should provide "selector" methods:

- Time (``time_col``): You can provide a single time column as a string or a list of column name candidates. By default, the function will attempt names from the following list: "Time (hours)", "Time", "time", "t". The found column will be stored as the ``time_col`` attribute.
- X columns (consumers, ``x_selector``): A lambda, a regex, or a list of specific names that determines which column is a consumer, for example ``lambda c: c.startswith("NCBI:")``. If this is omitted, consumer columns will be considered to be the ones that don't match the ``s_selector``.
- S columns (substrates, ``s_selector``): A lambda or a regex that marks substrates. If this is omitted, substrate columns will be considered to be the ones that remain after consumers are found with the ``x_selector``.
- Error columns (``err_selector``): A lambda or a regex that detects error columns. At this time, they'll only be used when plotting to show the data's uncertainty, and won't be used in modeling. If given as a regex, it needs to have a single capture group that returns the target column it is the error of. For instance, if we have the columns "Bacteroides" and "Bacteroides STD", we could use the selector ``r'(.*) STD$'``. If the selector is a function, it needs to return the target column or ``None``.

At least one of ``x_selector`` or ``s_selector`` need to be provided or the method will raise an exception. The ``err_selector`` is optional and only useful if you have a dataset with standard deviation columns for each value column that follow a standardized scheme.

The dataloader will store the cleaned data frame, the time column as a numpy array, the Y values and a :class:`~mgrowthctrl.models.base.ModelNames` object with the detected column names.

Example snippet:

.. code-block:: python

    from mgrowthctrl.utils.data import Dataloader

    # Load experimental data, detecting names of consumer and resource columns:
    data = Dataloader()
    data.load_local_data("examples/datasets/bhbtri_3species.csv", x_selector=r"Cells")

    print(data.names)
    print(data.df)

If you have a dataset that includes error values, the loader expects that these columns are labeled in a consistent way. For instance, the example file `BT_WC_export_average.csv <https://gitlab.com/ComputationalScience/mgrowthctrl/-/raw/main/examples/datasets/BT_WC_export_average.csv>`_ uses the string "STD" to indicate standard deviation columns, located between the name of the measured target and the column units. The value column "BT counts" has an error column "BT counts STD", while the value column "acetate (mM)" has an error column "acetate STD (mM)". We can use the following code to correctly detect value columns and error columns:

.. code-block:: python

    from mgrowthctrl.utils.data import Dataloader

    data = Dataloader()
    data.load_local_data(
        "examples/datasets/BT_WC_export_average.csv",

        # Use list of strings to name the single X column directly
        x_selector=['BT counts'],

        # Use a lambda to find the " STD" string anywhere in the column and return
        # the STD-less string as the target value column:
        err_selector=lambda col: col.replace(' STD', '') if ' STD' in col else None,
    )

    print(data.X_names)
    #=> ['BT counts']
    print(data.X_err_names)
    #=> ['BT counts STD']

    print(data.S_names)
    #=> ['acetate (mM)', 'formate (mM)', ...]
    print(data.S_err_names)
    #=> ['acetate STD (mM)', 'formate STD (mM)', ...]

Loading from mGrowthDB
======================

Instead of loading data from a local file and manually describing which columns are consumers and which are resources, you can fetch data from an `mGrowthDB`_ experiment by using the ``load_remote_experiment`` method. Because the remote source provides metadata for the experiments, you don't need to explicitly provide column name selectors.

Example snippet that fetches the data of experiment `EMGDB000000020 <https://mgrowthdb.gbiomed.kuleuven.be/experiment/EMGDB000000020/>`_:

.. code-block:: python

    from mgrowthctrl.utils.data import Dataloader

    # Load experimental data, detecting names of consumer and resource columns:
    data = Dataloader()
    data.load_remote_experiment("EMGDB000000020")

    print(data.names)
    print(data.df)

By default, the loader will attempt to download the computed "average" bioreplicate of that experiment. If there is no average bioreplicate, the first one will be downloaded. If you'd like to pick a specific bioreplicate, you can select it by the name you can find on the web application:

.. code-block:: python

    # Load experimental data, detecting names of consumer and resource columns:
    data = Dataloader()
    data.load_remote_experiment("EMGDB000000020", bioreplicate_name="BT_WC_2")

    print(data._remote_bioreplicate_metadata["name"])
    #=> BT_WC_2

For debugging purposes, you could access the JSON response from the mGrowthDB API in the ``_remote_bioreplicate_metadata`` field. To learn more about the API and/or use it directly, you can consult the `mGrowthDB API documentation <https://mgrowthdb.readthedocs.io/en/latest/api.html>`_.

API Documentation
=================

.. automodule:: mgrowthctrl.utils.data
    :members:
    :member-order: bysource