Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 39 additions & 0 deletions docs/management/external.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
.. _external:

Non-Gadi use
============

ACCESS-NRI Intake Catalog is designed for managing the collection Intake-ESM datastores on the
`NCI Gadi <https://nci.org.au/our-systems/hpc-systems>`_ environment.
However, it is able to be modified to manage Intake-ESM datastores stored on other systems.

.. warning::
Adapting ACCESS-NRI Intake Catalog to run on another computer system is an advanced task, that requires code
modification, the ability to install this code on your target machine, and an intimate knowledge of the
file system on your target system. It is not recommended for the faint-hearted!

To adapt ACCESS-NRI Intake Catalog to run on a non-Gadi system, check out a copy of the source code and consider
making the following modifications:

1. The constants in the top level :code:`__init__.py` file describe where the input data are stored, the patterns
and regular expressions used to match related file paths, and the location of the final catalog file.
These will need to be updated to reflect the target file system structure.

a. Gadi stores data in a file system that is arrayed by projects. Projects are denoted by an alphanumeric code of
one to two letters, followed by one to two digits, e.g., :code:`hh5`, :code:`io10`, etc. Data is then stored
at targets like :code:`/g/data/<project code>/`.

The :code:`catalog-build` command, which invokes the function :code:`cli.build`, has a sequence of calls to determine
the projects that are involved in a catalog build, and checks that the build user has access to those project
storage locations (this code section currently starts at line 473 of :code:`cli.py`).
The build will be aborted if these checks fail. Therefore, if your storage does not
use a similar directory structure to Gadi (that is, a group of directories all situated at one root location), you may need
to modify or remove these calls to achieve a successful catalog build.

2. The existing YAML files in :code:`config/` refer to the datastores/raw data stored on Gadi. You will need to
remove these, and replace them with similarly-structured YAML files denoting your own data setup. (Note that the
contents of :code:`config/metadata-sources` are archival copies of live experiment :ref:`metadata`;
you will not need to replace these on your system.)

3. The command-line scripts in :code:`bin/` contain PBS commands and file paths specific to Gadi. You will need
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should then be able to just update all this to reflect that the defaults can be overridden with environment variables.

to modify these scripts to reflect your computing system.
1 change: 1 addition & 0 deletions docs/management/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,3 +18,4 @@ of access-nri-intake and how they can be used to build and manage a catalog like
building
schema
release
external
12 changes: 12 additions & 0 deletions src/access_nri_intake/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,16 @@
__version__ = _version.get_versions()["version"]

CATALOG_LOCATION = "/g/data/xp65/public/apps/access-nri-intake-catalog/catalog.yaml"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a trick we use for managing config files which is common in Django/Web Development which might be relevant/useful here:

import os

CATALOG_LOCATION = os.environ.get("CATALOG_LOCATION", "/g/data/xp65/public/apps/access-nri-intake-catalog/catalog.yaml")

Basically, it lets us override the default with an environment variable, if found, falling back to the hardcoded default. It strikes me that it might be useful here, although we'd probably want to prefix all the environment variable names as a namespacing strategy, eg.:

CATALOG_LOCATION = os.environ.get(
    "ACCESS_NRI_INTAKE_CATALOG_LOCATION", "/g/data/xp65/public/apps/access-nri-intake-catalog/catalog.yaml"
)

"""Location for 'live' master catalog YAML."""

USER_CATALOG_LOCATION = str(Path.home() / ".access_nri_intake_catalog/catalog.yaml")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
USER_CATALOG_LOCATION = str(Path.home() / ".access_nri_intake_catalog/catalog.yaml")
USER_CATALOG_LOCATION = str(Path.home() / ".access_nri_intake_catalog/catalog.yaml")
USER_CATALOG_LOCATION = os.environ.get(
"ACCESS_NRI_INTAKE_USER_CAT_LOCATION", str(Path.home() / ".access_nri_intake_catalog/catalog.yaml")
)

"""Location where user can place a master catalog YAML to override standard 'live' version."""

STORAGE_ROOT = "/g/data"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
STORAGE_ROOT = "/g/data"
STORAGE_ROOT = os.environ.get(
"ACCESS_NRI_INTAKE_STORAGE_ROOT", "/g/data"
)

"""Root storage location for catalog experiments"""

STORAGE_FLAG_PATTERN = r"gdata/[a-z]{1,2}[0-9]{1,2}"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
STORAGE_FLAG_PATTERN = r"gdata/[a-z]{1,2}[0-9]{1,2}"
STORAGE_FLAG_PATTERN = = os.environ.get(
"ACCESS_NRI_INTAKE_STORAGE_FLAG_PATTERN", r"gdata/[a-z]{1,2}[0-9]{1,2}"
)

"""Pattern for matching 'storage flags' - related to Gadi file access system"""

STORAGE_LOCATION_REGEX = r"^/g/data/(?P<proj>[a-z]{1,2}[0-9]{1,2})/.*?$"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
STORAGE_LOCATION_REGEX = r"^/g/data/(?P<proj>[a-z]{1,2}[0-9]{1,2})/.*?$"
STORAGE_LOCATION_REGEX = os.environ.get(
"ACCESS_NRI_INTAKE_STORAGE_LOCATION_REGEX", r"^/g/data/(?P<proj>[a-z]{1,2}[0-9]{1,2})/.*?$"

"""Regular expression for matching the file path to experiments, and extracting a project ID"""
13 changes: 7 additions & 6 deletions src/access_nri_intake/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
import yaml
from intake import open_esm_datastore

from . import STORAGE_FLAG_PATTERN, STORAGE_LOCATION_REGEX, STORAGE_ROOT
from .catalog import EXP_JSONSCHEMA, translators
from .catalog.manager import CatalogManager
from .data import CATALOG_NAME_FORMAT
Expand All @@ -27,8 +28,6 @@
from .source import builders
from .utils import _can_be_array, get_catalog_fp, load_metadata_yaml

STORAGE_FLAG_PATTERN = "gdata/[a-z]{1,2}[0-9]{1,2}"


class MetadataCheckError(Exception):
pass
Expand Down Expand Up @@ -182,7 +181,7 @@ def _parse_build_directory(


def _get_project_code(path: str | Path):
match = re.match(r"/g/data/([^/]*)/.*", str(path))
match = re.match(STORAGE_LOCATION_REGEX, str(path))
return match.groups()[0] if match else None


Expand All @@ -203,14 +202,14 @@ def _get_project(paths: list[str], method: str | None = None) -> set[str]:

def _confirm_project_access(projects: set[str]) -> tuple[bool, str]:
"""
Return False and the missing project if the user can't access all necessary projects' /g/data spaces.
Return False and the missing project if the user can't access all necessary projects' `access_nri_intake.STORAGE_ROOT` spaces.

Returns:
tuple[bool, str]: Whether the user can access all projects, and a string of any missing projects
"""
missing_projects = []
for proj in sorted(projects):
p = Path("/g/data") / proj
p = Path(STORAGE_ROOT) / proj
if not p.exists():
missing_projects.append(proj)

Expand Down Expand Up @@ -515,7 +514,9 @@ def build(argv: Sequence[str] | None = None):
else:
warnings.warn(f"Unable to determine project for base path {build_base_path}")

storage_flags = "+".join(sorted([f"gdata/{proj}" for proj in project if proj]))
storage_flags = "+".join(
sorted([f"{STORAGE_ROOT.replace('/', '')}/{proj}" for proj in project if proj])
)

_valid_permissions, _err_msg = _confirm_project_access(project)
if not _valid_permissions:
Expand Down
Loading