Ubuntu Manpage: intake - Intake Documentation

Provided by: python3-intake_0.6.5-1_amd64

NAME

       intake - Intake Documentation

       Taking the pain out of data access and distribution

       Intake  is  a  lightweight  package  for  finding, investigating, loading and disseminating data. It will
       appeal to different groups for some of the reasons below, but is useful for all  and  acts  as  a  common
       platform that everyone can use to smooth the progression of data from developers and providers to users.

       Intake  contains  the following main components. You do not need to use them all! The library is modular,
       only use the parts you need:

       • A set of data loaders (Drivers) with a common interface, so that you can investigate or load  anything,
         from  local or remote, with the exact same call, and turning into data structures that you already know
         how to manipulate, such as arrays and data-frames.

       • A Cataloging system (Catalogs) for listing data sources, their metadata and parameters, and referencing
         which of the Drivers should load each. The catalogs for a hierarchical, searchable structure, which can
         be backed by files, Intake servers or third-party data services

       • Sets of convenience functions to apply to various data sources, such as data-set persistence, automatic
         concatenation and metadata inference and the ability to distribute  catalogs  and  data  sources  using
         simple packaging abstractions.

       • A  GUI  layer accessible in the Jupyter notebook or as a standalone webserver, which allows you to find
         and navigate  catalogs,  investigate  data  sources,  and  plot  either  predefined  visualisations  or
         interactively find the right view yourself

       • A  client-server  protocol to allow for arbitrary data cataloging services or to serve the data itself,
         with a pluggable auth model.

DATA USER

       • Intake loads the data for a range of formats and  types  (see  plugin-directory)  into  containers  you
         already use, like Pandas dataframes, Python lists, NumPy arrays, and more

       • Intake loads, then gets out of your way

       • GUI search and introspect data-sets in Catalogs: quickly find what you need to do your work

       • Install data-sets and automatically get requirements

       • Leverage cloud resources and distributed computing.

       See the executable tutorial:

       https://mybinder.org/v2/gh/intake/intake-examples/master?filepath=tutorial%2Fdata_scientist.ipynb

DATA PROVIDER

       • Simple spec to define data sources

       • Single point of truth, no more copy&paste

       • Distribute data using packages, shared files or a server

       • Update definitions in-place

       • Parametrise user options

       • Make use of additional functionality like filename parsing and caching.

       See the executable tutorial:

       https://mybinder.org/v2/gh/intake/intake-examples/master?filepath=tutorial%2Fdata_engineer.ipynb

IT

       • Create catalogs out of established departmental practices

       • Provide data access credentials via Intake parameters

       • Use server-client architecture as gatekeeper:

            • add authentication methods

            • add monitoring point; track the data-sets being accessed.

       • Hook Intake into proprietary data access systems.

DEVELOPER

       • Turn boilerplate code into a reusable Driver

       • Pluggable architecture of Intake allows for many points to add and improve

       • Open, simple code-base -- come and get involved on github!

       See the executable tutorial:

       https://mybinder.org/v2/gh/intake/intake-examples/master?filepath=tutorial%2Fdev.ipynb

       The start document contains the sections that all users new to Intake should read through. usecases shows
       specific  problems  that  Intake solves.  For a brief demonstration, which you can execute locally, go to
       quickstart.  For a general description of all of the components of Intake and how they fit  together,  go
       to  overview.  Finally,  for  some  notebooks  using Intake and articles about Intake, go to examples and
       intake-examples.  These and other documentation pages will make reference to concepts that are defined in
       the glossary.

START HERE

These documents will familiarise you with Intake, show you some basic usage and examples, and describe
Intake's place in the wider python data world.

Quickstart
This guide will show you how to get started using Intake to read data, and give you a flavour of how
Intake feels to the Data User. It assumes you are working in either a conda or a virtualenv/pip
environment. For notebooks with executable code, see the examples. This walk-through can be run from a
notebook or interactive python session.

Installation
If you are using Anaconda or Miniconda, install Intake with the following commands:

conda install -c conda-forge intake

If you are using virtualenv/pip, run the following command:

pip install intake

Note that this will install with the minimum of optional requirements. If you want a more complete
install, use intake[complete] instead.

Creating Sample Data
Let's begin by creating a sample data set and catalog. At the command line, run the intake example
command. This will create an example data Catalog and two CSV data files. These files contains some
basic facts about the 50 US states, and the catalog includes a specification of how to load them.

Loading a Data Source
Data sources can be created directly with the open_*() functions in the intake module. To read our
example data:

>>> import intake
>>> ds = intake.open_csv('states_*.csv')
>>> print(ds)
<intake.source.csv.CSVSource object at 0x1163882e8>

Each open function has different arguments, specific for the data format or service being used.

Reading Data
Intake reads data into memory using containers you are already familiar with:

• Tables: Pandas DataFrames

• Multidimensional arrays: NumPy arrays

• Semistructured data: Python lists of objects (usually dictionaries)

To find out what kind of container a data source will produce, inspect the container attribute:

>>> ds.container
'dataframe'

The result will be dataframe, ndarray, or python. (New container types will be added in the future.)

For data that fits in memory, you can ask Intake to load it directly:

>>> df = ds.read()
>>> df.head()
state slug code nickname ...
0 Alabama alabama AL Yellowhammer State
1 Alaska alaska AK The Last Frontier
2 Arizona arizona AZ The Grand Canyon State
3 Arkansas arkansas AR The Natural State
4 California california CA Golden State

Many data sources will also have quick-look plotting available. The attribute .plot will list a number of
built-in plotting methods, such as .scatter(), see plotting.

Intake data sources can have partitions. A partition refers to a contiguous chunk of data that can be
loaded independent of any other partition. The partitioning scheme is entirely up to the plugin author.
In the case of the CSV plugin, each .csv file is a partition.

To read data from a data source one chunk at a time, the read_chunked() method returns an iterator:

>>> for chunk in ds.read_chunked(): print('Chunk: %d' % len(chunk))
...
Chunk: 24
Chunk: 26

Working with Dask
Working with large datasets is much easier with a parallel, out-of-core computing library like Dask.
Intake can create Dask containers (like dask.dataframe) from data sources that will load their data only
when required:

>>> ddf = ds.to_dask()
>>> ddf
Dask DataFrame Structure:
admission_date admission_number capital_city capital_url code constitution_url facebook_url landscape_background_url map_image_url nickname population population_rank skyline_background_url slug state state_flag_url state_seal_url twitter_url website
npartitions=2
object int64 object object object object object object object object int64 int64 object object object object object object object
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Dask Name: from-delayed, 4 tasks

The Dask containers will be partitioned in the same way as the Intake data source, allowing different
chunks to be processed in parallel. Please read the Dask documentation to understand the differences when
working with Dask collections (Bag, Array or Data-frames).

Opening a Catalog
A Catalog is a collection of data sources, with the type and arguments prescribed for each, and arbitrary
metadata about each source. In the simplest case, a catalog can be described by a file in YAML format, a
"Catalog file". In real usage, catalogues can be defined in a number of ways, such as remote files, by
connecting to a third-party data service (e.g., SQL server) or through an Intake Server protocol, which
can implement any number of ways to search and deliver data sources.

The intake example command, above, created a catalog file with the following YAML-syntax content:

sources:
states
description: US state information from [CivilServices](https://civil.services/)
driver: csv
args:
urlpath: '{{ CATALOG_DIR }}/states_*.csv'
metadata:
origin_url: 'https://github.com/CivilServiceUSA/us-states/blob/v1.0.0/data/states.csv'

To load a Catalog from a Catalog file:

>>> cat = intake.open_catalog('us_states.yml')
>>> list(cat)
['states']

This catalog contains one data source, called states. It can be accessed by attribute:

>>> cat.states.to_dask()[['state','slug']].head()
state slug
0 Alabama alabama
1 Alaska alaska
2 Arizona arizona
3 Arkansas arkansas
4 California california

Placing data source specifications into a catalog like this enables declaring data sets in a single
canonical place, and not having to use boilerplate code in each notebook/script that makes use of the
data. The catalogs can also reference one-another, be stored remotely, and include extra metadata such as
a set of named quick-look plots that are appropriate for the particular data source. Note that catalogs
are not restricted to being stored in YAML files, that just happens to be the simplest way to display
them.

Many catalog entries will also contain "user_parameter" blocks, which are indications of options
explicitly allowed by the catalog author, or for validation or the values passed. The user can customise
how a data source is accessed by providing values for the user_parameters, overriding the arguments
specified in the entry, or passing extra keyword arguments to be passed to the driver. The keywords that
should be passed are limited to the user_parameters defined and the inputs expected by the specific
driver - such usage is expected only from those already familiar with the specifics of the given format.
In the following example, the user overrides the "csv_kwargs" keyword, which is described in the
documentation for CSVSource and gets passed down to the CSV reader:

# pass extra kwargs understood by the csv driver
>>> intake.cat.states(csv_kwargs={'header': None, 'skiprows': 1}).read().head()
0 1 ... 17
0 Alabama alabama ... https://twitter.com/alabamagov
1 Alaska alaska ... https://twitter.com/alaska

Note that, if you are creating such catalogs, you may well start by trying the open_csv command, above,
and then use print(ds.yaml()). If you do this now, you will see that the output is very similar to the
catalog file we have provided.

Installing Data Source Packages
Intake makes it possible to create Data packages (pip or conda) that install data sources into a global
catalog. For example, we can install a data package containing the same data we have been working with:

conda install -c intake data-us-states

Conda installs the catalog file in this package to $CONDA_PREFIX/share/intake/us_states.yml. Now, when
we import intake, we will see the data from this package appear as part of a global catalog called
intake.cat. In this particular case we use Dask to do the reading (which can handle larger-than-memory
data and parallel processing), but read() would work also:

>>> import intake
>>> intake.cat.states.to_dask()[['state','slug']].head()
state slug
0 Alabama alabama
1 Alaska alaska
2 Arizona arizona
3 Arkansas arkansas
4 California california

The global catalog is a union of all catalogs installed in the conda/virtualenv environment and also any
catalogs installed in user-specific locations.

Adding Data Source Packages using the Intake path
Intake checks the Intake config file for catalog_path or the environment variable "INTAKE_PATH" for a
colon separated list of paths (semicolon on windows) to search for catalog files. When you import intake
we will see all entries from all of the catalogues referenced as part of a global catalog called
intake.cat.

Using the GUI
A graphical data browser is available in the Jupyter notebook environment or standalone web-server. It
will show the contents of any installed catalogs, plus allows for selecting local and remote catalogs, to
browse and select entries from these. See gui.

Use Cases - I want to...
Here follows a list of specific things that people may want to get done, and details of how Intake can
help. The details of how to achieve each of these activities can be found in the rest of the detailed
documentation.

Avoid copy&paste of blocks of code for accessing data
This is a very common pattern, if you want to load some specific data, to find someone, perhaps a
colleague, who has accessed it before, and copy that code. Such a practice is extremely error prone, and
cause a proliferation of copies of code, which may evolve over time, with various versions simultaneously
in use.

Intake separates the concerns of data-source specification from code. The specs are stored separately,
and all users can reference the one and only authoritative definition, whether in a shared file, a
service visible to everyone or by using the Intake server. This spec can be updated so that everyone gets
the current version instead of relying on outdated code.

Version control data sources
Version control (e.g., using git) is an essential practice in modern software engineering and data
science. It ensures that the change history is recorded, with times, descriptions and authors along with
the changes themselves.

When data is specified using a well-structured syntax such as YAML, it can be checked into a version
controlled repository in the usual fashion. Thus, you can bring rigorous practices to your data as well
as your code.

If using conda packages to distribute data specifications, these come with a natural internal version
numbering system, such that users need only do conda update ... to get the latest version.

Install data
Often, finding and grabbing data is a major hurdle to productivity. People may be required to download
artifacts from various places or search through storage systems to find the specific thing that they are
after. One-line commands which can retrieve data-source specifications or the files themselves can be a
massive time-saver. Furthermore, each data-set will typically need its own code to be able to access it,
and probably additional software dependencies.

Intake allows you to build conda packages, which can include catalog files referencing online resources,
or to include data files directly in that package. Whether uploaded to anaconda.org or hosted on a
private enterprise channel, getting the data becomes a single conda install ... command, whereafter it
will appear as an entry in intake.cat. The conda package brings versioning and dependency declaration for
free, and you can include any code that may be required for that specific data-set directly in the
package too.

Update data specifications in-place
Individual data-sets often may be static, but commonly, the "best" data to get a job done changes with
time as new facts emerge. Conversely, the very same data might be better stored in a different format
which is, for instance, better-suited to parallel access in the cloud. In such situations, you really
don't want to force all the data scientists who rely on it to have their code temporarily broken and be
forced to change this code.

By working with a catalog file/service in a fixed shared location, it is possible to update the data
source specs in-place. When users now run their code, they will get the latest version. Because all
Intake drivers have the same API, the code using the data will be identical and not need to be changed,
even when the format has been updated to something more optimised.

Access data stored on cloud resources
Services such as AWS S3, GCS and Azure Datalake (or private enterprise variants of these) are
increasingly popular locations to amass large amounts of data. Not only are they relatively cheap per GB,
but they provide long-term resilience, metadata services, complex access control patterns and can have
very large data throughput when accessed in parallel by machines on the same architecture.

Intake comes with integration to cloud-based storage out-of-the box for most of the file-based data
formats, to be able to access the data directly in-place and in parallel. For the few remaining cases
where direct access is not feasible, the caching system in Intake allows for download of files on first
use, so that all further access is much faster.

Work with Big Data
The era of Big Data is here! The term means different things to different people, but certainly implies
that an individual data-set is too large to fit into the memory of a typical workstation computer
(>>10GB). Nevertheless, most data-loading examples available use functions in packages such as pandas and
expect to be able to produce in-memory representations of the whole data. This is clearly a problem, and
a more general answer should be available aside from "get more memory in your machine".

Intake integrates with Dask and Spark, which both offer out-of-core computation (loading the data in
chunks which fit in memory and aggregating result) or can spread their work over a cluster of machines,
effectively making use of the shared memory resources of the whole cluster. Dask integration is built
into the majority of the the drivers and exposed with the .to_dask() method, and Spark integration is
available for a small number of drivers with a similar .to_spark() method, as well as directly with the
intake-spark package.

Intake also integrates with many data services which themselves can perform big-data computations, only
extracting the smaller aggregated data-sets that do fit into memory for further analysis. Services such
as SQL systems, solr, elastic-search, splunk, accumulo and hbase all can distribute the work required to
fulfill a query across many nodes of a cluster.

Find the right data-set
Browsing for the data-set which will solve a particular problem can be hard, even when the data have been
curated and stored in a single, well-structured system. You do not want to rely on word-of-mouth to
specify which data is right for which job.

Intake catalogs allow for self-description of data-sets, with simple text and arbitrary metadata, with a
consistent access pattern. Not only can you list the data available to you, but you can find out what
exactly that data represents, and the form the data would take if loaded (table versus list of items, for
example). This extra metadata is also searchable: you can descend through a hierarchy of catalogs with a
single search, and find all the entries containing some particular keywords.

You can use the Intake GUI to graphically browse through your available data-sets or point to catalogs
available to you, look through the entries listed there and get information about each, or even show a
sample of the data or quick-look plots. The GUI is also able to execute searches and browse file-systems
to find data artifacts of interest. This same functionality is also available via a command-line
interface or programmatically.

Work remotely
Interacting with cloud storage resources is very convenient, but you will not want to download large
amounts of data to your laptop or workstation for analysis. Intake finds itself at home in the
remote-execution world of jupyter and Anaconda Enterprise and other in-browser technologies. For
instance, you can run the Intake GUI either as a stand-alone application for browsing data-sets or in a
notebook for full analytics, and have all the runtime live on a remote machine, or perhaps a cluster
which is co-located with the data storage. Together with cloud-optimised data formats such as parquet,
this is an ideal set-up for processing data at web scale.

Transform data to efficient formats for sharing
A massive amount of data exists in human-readable formats such as JSON, XML and CSV, which are not very
efficient in terms of space usage and need to be parsed on load to turn into arrays or tables. Much
faster processing times can be had with modern compact, optimised formats, such as parquet.

Intake has a "persist" mechanism to transform any input data-source into the format most appropriate for
that type of data, e.g., parquet for tabular data. The persisted data will be used in preference at
analysis time, and the schedule for updating from the original source is configurable. The location of
these persisted data-sets can be shared with others, so they can also gain the benefits, or the "export"
variant can be used to produce an independent version in the same format, together with a spec to
reference it by; you would then share this spec with others.

Access data without leaking credentials
Security is important. Users' identity and authority to view specific data should be established before
handing over any sensitive bytes. It is, unfortunately, all too common for data scientists to include
their username, passwords or other credentials directly in code, so that it can run automatically, thus
presenting a potential security gap.

Intake does not manage credentials or user identities directly, but does provide hooks for fetching
details from the environment or other service, and using the values in templating at the time of reading
the data. Thus, the details are not included in the code, but every access still requires for them to be
present.

In other cases, you may want to require the user to provide their credentials every time, rather that
automatically establish them, and "user parameters" can be specified in Intake to cover this case.

Establish a data gateway
The Intake server protocol allows you fine-grained control over the set of data sources that are listed,
and exactly what to return to a user when they want to read some of that data. This is an ideal
opportunity to include authorisation checks, audit logging, and any more complicated access patterns, as
required.

By streaming the data through a single channel on the server, rather than allowing users direct access to
the data storage backend, you can log and verify all access to your data.

Clear distinction between data curator and analyst roles
It is desirable to separate out two tasks: the definition of data-source specifications, and accessing
and using data. This is so that those who understand the origins of the data and the implications of
various formats and other storage options (such as chunk-size) should make those decisions and encode
what they have done into specs. It leaves the data users, e.g., data scientists, free to find and use the
data-sets appropriate for their work and simply get on with their job - without having to learn about
various storage formats and access APIs.

This separation is at the very core of what Intake was designed to do.

Users to be able to access data without learning every backend API
Data formats and services are a wide mess of many libraries and APIs. A large amount of time can be
wasted in the life of a data scientist or engineer in finding out the details of the ones required by
their work. Intake wraps these various libraries, REST APIs and similar, to provide a consistent
experience for the data user. source.read() will simply get all of the data into memory in the container
type for that source - no further parameters or knowledge required.

Even for the curator of data catalogs or data driver authors, the framework established by Intake
provides a lot of convenience and simplification which allows each person to deal with only the specifics
of their job.

Data sources to be self-describing
Having a bunch of files in some directory is a very common pattern for data storage in the wild. There
may or may not be a README file co-located giving some information in a human-readable form, but
generally not structured - such files are usually different in every case.

When a data source is encoded into a catalog, the spec offers a natural place to describe what that data
is, along with the possibility to provide an arbitrary amount of structured metadata and to describe any
parameters that are to be exposed for user choice. Furthermore, Intake data sources each have a
particular container type, so that users know whether to expect a dataframe, array, etc., and simple
introspection methods like describe and discover which return basic information about the data without
having to load all of it into memory first.

A data source hierarchy for natural structuring
Usually, the set of data sources held by an organisation have relationships to one another, and would be
poorly served to be provided as a simple flat list of everything available. Intake allows catalogs to
refer to other catalogs. This means, that you can group data sources by various facets (type, department,
time...) and establish hierarchical data-source trees within which to find the particular data most
likely to be of interest. Since the catalogs live outside and separate from the data files themselves,
as many hierarchy structures as thought useful could be created.

For even more complicated data source meta-structures, it is possible to store all the details and even
metadata in some external service (e.g., traditional SQL tables) with which Intake can interact to
perform queries and return particular subsets of the available data sources.

Expose several data collections under a single system
There are already several catalog-like data services in existence in the world, and some organisation may
have several of these in-house for various different purposes. For example, an SQL server may hold
details of customer lists and transactions, but historical time-series and reference data may be held
separately in archival data formats like parquet on a file-storage system; while real-time system
monitoring is done by a totally unrelated system such as Splunk or elastic search.

Of course, Intake can read from various file formats and data services. However, it can also interpret
the internal conception of data catalogs that some data services may have. For example, all of the tables
known to the SQL server, or all of the pre-defined queries in Splunk can be automatically included as
catalogs in Intake, and take their place amongst the regular YAML-specified data sources, with exactly
the same usage for all of them.

These data sources and their hierarchical structure can then be exposed via the graphical data browser,
for searching, selecting and visualising data-sets.

Modern visualisations for all data-sets
Intake is integrated with the comprehensive holoviz suite, particularly hvplot, to bring simple yet
powerful data visualisations to any Intake data source by using just one single method for everything.
These plots are interactive, and can include server-side dynamic aggregation of very large data-sets to
display more data points than the browser can handle.

You can specify specific plot types right in the data source definition, to have these customised
visualisations available to the user as simple one-liners known to reveal the content of the data, or
even view the same visuals right in the graphical data source browser application. Thus, Intake is
already an all-in-one data investigation and dashboarding app.

Update data specifications in real time
Intake data catalogs are not limited to reading static specification from files. They can also execute
queries on remote data services and return lists of data sources dynamically at runtime. New data sources
may appear, for example, as directories of data files are pushed to a storage service, or new tables are
created within a SQL server.

Distribute data in a custom format
Sometimes, the well-known data formats are just not right for a given data-set, and a custom-built format
is required. In such cases, the code to read the data may not exist in any library. Intake allows for
code to be distributed along with data source specs/catalogs or even files in a single conda package.
That encapsulates everything needed to describe and use that particular data, and can then be distributed
as a single entity, and installed with a one-liner.

Furthermore, should the few builtin container types (sequence, array, dataframe) not be sufficient, you
can supply your own, and then build drivers that use it. This was done, for example, for xarray-type
data, where multiple related N-D arrays share a coordinate system and metadata. By creating this
container, a whole world of scientific and engineering data was opened up to Intake. Creating new
containers is not hard, though, and we foresee more coming, such as machine-learning models and
streaming/real-time data.

Create Intake data-sets from scratch
If you have a set of files or a data service which you wish to make into a data-set, so that you can
include it in a catalog, you should use the set of functions intake.open_*, where you need to pick the
function appropriate for your particular data. You can use tab-completion to list the set of data drivers
you have installed, and find others you may not yet have installed at plugin-directory. Once you have
determined the right set of parameters to load the data in the manner you wish, you can use the source's
.yaml() method to find the spec that describes the source, so you can insert it into a catalog (with
appropriate description and metadata). Alternatively, you can open a YAML file as a catalog with
intake.open_catalog and use its .add() method to insert the source into the corresponding file.

If, instead, you have data in your session in one of the containers supported by Intake (e.g., array,
data-frame), you can use the intake.upload() function to save it to files in an appropriate format and a
location you specify, and give you back a data-source instance, which, again, you can use with .yaml() or
.add(), as above.

Overview
Introduction
This page describes the technical design of Intake, with brief details of the aims of the project and
components of the library

Why Intake?
Intake solves a related set of problems:

• Python API standards for loading data (such as DB-API 2.0) are optimized for transactional databases
and query results that are processed one row at a time.

• Libraries that do load data in bulk tend to each have their own API for doing so, which adds friction
when switching data formats.

• Loading data into a distributed data structure (like those found in Dask and Spark) often requires
writing a separate loader.

• Abstractions often focus on just one data model (tabular, n-dimensional array, or semi-structured),
when many projects need to work with multiple kinds of data.

Intake has the explicit goal of not defining a computational expression system. Intake plugins load the
data into containers (e.g., arrays or data-frames) that provide their data processing features. As a
result, it is very easy to make a new Intake plugin with a relatively small amount of Python.

Structure
Intake is a Python library for accessing data in a simple and uniform way. It consists of three parts:

1. A lightweight plugin system for adding data loader drivers for new file formats and servers (like
databases, REST endpoints or other cataloging services)

2. A cataloging system for specifying these sources in simple YAML syntax, or with plugins that read
source specs from some external data service

3. A server-client architecture that can share data catalog metadata over the network, or even stream the
data directly to clients if needed

Intake supports loading data into standard Python containers. The list can be easily extended, but the
currently supported list is:

• Pandas Dataframes - tabular data

• NumPy Arrays - tensor data

• Python lists of dictionaries - semi-structured data

Additionally, Intake can load data into distributed data structures. Currently it supports Dask, a
flexible parallel computing library with distributed containers like dask.dataframe, dask.array, and
dask.bag. In the future, other distributed computing systems could use Intake to create similar data
structures.

Concepts
Intake is built out of four core concepts:

• Data Source classes: the "driver" plugins that each implement loading of some specific type of data
into python, with plugin-specific arguments.

• Data Source: An object that represents a reference to a data source. Data source instances have
methods for loading the data into standard containers, like Pandas DataFrames, but do not load any data
until specifically requested.

• Catalog: A collection of catalog entries, each of which defines a Data Source. Catalog objects can be
created from local YAML definitions, by connecting to remote servers, or by some driver that knows how
to query an external data service.

• Catalog Entry: A named data source held internally by catalog objects, which generate data source
instances when accessed. The catalog entry includes metadata about the source, as well as the name of
the driver and arguments. Arguments can be parameterized, allowing one entry to return different
subsets of data depending on the user request.

The business of a plugin is to go from some data format (bunch of files or some remote service) to a
"Container" of the data (e.g., data-frame), a thing on which you can perform further analysis. Drivers
can be used directly by the user, or indirectly through data catalogs. Data sources can be pickled, sent
over the network to other hosts, and reopened (assuming the remote system has access to the required
files or servers).

USER GUIDE

       More detailed information about specific parts of Intake, such as how to author catalogs, how to use  the
       graphical interface, plotting, etc.

   GUI
   Using the GUI
       Note: the GUI requires panel and bokeh to be available in the current environment.

       The  Intake  top-level  singleton  intake.gui gives access to a graphical data browser within the Jupyter
       notebook. To expose it, simply enter it into a code cell (Jupyter automatically display the  last  object
       in a code cell).  [image]

       New  instances  of  the  GUI  are also available by instantiating intake.interface.gui.GUI, where you can
       specify a list of catalogs to initially include.

       The GUI contains three main areas:

       • a list of catalogs. The "builtin" catalog, displayed by default, includes data-sets  installed  in  the
         system, the same as intake.cat.

       • a list of sources within the currently selected catalog.

       • a description of the currently selected source.

   Catalogs
       Selecting  a  catalog  from  the  list  will  display nested catalogs below the parent and display source
       entries from the catalog in the list of sources.

       Below the lists of catalogs is a row of buttons that are used for adding, removing  and  searching-within
       catalogs:

       • Add:  opens  a sub-panel for adding catalogs to the interface, by either browsing for a local YAML file
         or by entering a URL for a catalog, which can be a remote file or Intake server

       • Remove: deletes the currently selected catalog from the list

       • Search: opens a sub-panel for finding entries in the currently selected catalog (and its sub-catalogs)

   Add Catalogs
       The Add button (+) exposes a sub-panel with two main ways to add catalogs to the interface: [image]

       This panel has a tab to load files from local; from that you can navigate around the filesystem using the
       arrow or by editing the path directly. Use the home button to get back to the starting place. Select  the
       catalog file you need. Use the "Add Catalog" button to add the catalog to the list above.  [image]

       Another   tab  loads  a  catalog  from  remote.  Any  URL  is  valid  here,  including  cloud  locations,
       "gcs://bucket/...", and intake servers, "intake://server:port". Without a protocol specifier, this can be
       a local path. Again, use the "Add Catalog" button to add the catalog to the list above.  [image]

       Finally, you can add catalogs to the  interface  in  code,  using  the  .add()  method,  which  can  take
       filenames, remote URLs or existing Catalog instances.

   Remove Catalogs
       The  Remove button (-) deletes the currently selected catalog from the list. It is important to note that
       this action does not have any impact on files, it only affects what shows up in the list.  [image]

   Search
       The sub-panel opened by the Search button (🔍) allows the user to  search  within  the  selected  catalog
       [image]

       From  the  Search  sub-panel  the  user  enters  for  free-form  text. Since some catalogs contain nested
       sub-catalogs, the Depth selector allows the search to be limited to the stated number of nesting  levels.
       This  may  be  necessary, since, in theory, catalogs can contain circular references, and therefore allow
       for infinite recursion.  [image]

       Upon execution of the search, the currently selected catalog will be searched. Entries will be considered
       to match if any of the entered words is found in the description of the entry (this is case-insensitive).
       If any matches are found, a new entry will be made in  the  catalog  list,  with  the  suffix  "_search".
       [image]

   Sources
       Selecting a source from the list updates the description text on the left-side of the gui.

       Below the list of sources is a row of buttons for inspecting the selected data source:

       • Plot:  opens  a  sub-panel  for  viewing the pre-defined (specified in the yaml) plots for the selected
         source.

   Plot
       The Plot button (📊) opens a sub-panel with an area for viewing pre-defined plots.  [image]

       These plots are specified in the catalog yaml and that yaml can be displayed by checking the box next  to
       "show yaml".  [image]

       The  holoviews  object  can be retrieved from the gui using intake.interface.source.plot.pane.object, and
       you can then use it in Python or export it to a file.

   Interactive Visualization
       If you have installed the optional extra packages dfviz  and  xrviz,  you  can  interactively  plot  your
       dataframe or array data, respectively.  [image]

       The  button  "customize"  will be available for data sources of the appropriate type.  Click this to open
       the interactive interface. If you have not selected a predefined plot  (or  there  are  none),  then  the
       interface  will start without any prefilled values, but if you do first select a plot, then the interface
       will have its options pre-filled from the options

       For specific instructions on how to use the interfaces (which can  also  be  used  independently  of  the
       Intake GUI), please navigate to the linked documentation.

       Note that the final parameters that are sent to hvPlot to produce the output each time a plot if updated,
       are  explicitly  available  in  YAML format, so that you can save the state as a "predefined plot" in the
       catalog. The same set of parameters can also be used in code, with datasource.plot(...).  [image]

   Using the Selection
       Once catalogs are loaded and the desired sources has been identified and selected, the  selected  sources
       will  be  available  at the .sources attribute (intake.gui.sources).  Each source entry has informational
       methods available and can be opened as a data source, as with any catalog entry:

          In [ ]: source_entry = intake.gui.sources[0]
                  source_entry
          Out   :
          name: sea_ice_origin
          container: dataframe
          plugin: ['csv']
          description: Arctic/Antarctic Sea Ice
          direct_access: forbid
          user_parameters: []
          metadata:
          args:
            urlpath: https://timeseries.weebly.com/uploads/2/1/0/8/21086414/sea_ice.csv

          In [ ]: data_source = source_entry()  # may specify parameters here
                  data_source.read()
          Out   : < some data >

          In [ ]: source_entry.plot()  # or skip data source step
          Out   : < graphics>

   Catalogs
       Data catalogs provide an abstraction  that  allows  you  to  externally  define,  and  optionally  share,
       descriptions  of  datasets,  called  catalog entries.  A catalog entry for a dataset includes information
       like:

       • The name of the Intake driver that can load the data

       • Arguments to the __init__() method of the driver

       • Metadata provided by the catalog author (such as field descriptions and types, or data provenance)

       In addition, Intake allows the arguments to data sources to be templated, with the  variables  explicitly
       expressed  as  "user parameters". The given arguments are rendered using jinja2, the values of named user
       parameterss, and any overrides.  The parameters are also  offer  validation  of  the  allowed  types  and
       values,  for  both  the template values and the final arguments passed to the data source. The parameters
       are named and described, to indicate to the user what they are for. This kind of structure  can  be  used
       to,  for  example,  choose  between two parts of a given data source, like "latest" and "stable", see the
       entry1_part entry in the example below.

       The user of the catalog can always override any template or argument value at the time that they access a
       give source.

   The Catalog class
       In Intake, a Catalog instance is an object with one or more named entries.  The  entries  might  be  read
       from  a  static  file  (e.g.,  YAML,  see the next section), from an Intake server or from any other data
       service that has a driver. Drivers which create catalogs are ordinary  DataSource  classes,  except  that
       they have the container type "catalog", and do not return data products via the read() method.

       For  example,  you might choose to instantiate the base class and fill in some entries explicitly in your
       code

          from intake.catalog import Catalog
          from intake.catalog.local import LocalCatalogEntry
          mycat = Catalog.from_dict({
              'source1': LocalCatalogEntry(name, description, driver, args=...),
              ...
              })

       Alternatively, subclasses of Catalog can define how entries are created from  whichever  file  format  or
       service they interact with, examples including RemoteCatalog and SQLCatalog. These generate entries based
       on their respective targets; some provide advanced search capabilities executed on the server.

   YAML Format
       Intake  catalogs  can  most simply be described with YAML files. This is very common in the tutorials and
       this documentation, because it simple to understand, but demonstrate the many features  of  Intake.  Note
       that  YAML  files  are also the easiest way to share a catalog, simply by copying to a publicly-available
       location such as a cloud storage bucket.  Here is an example:

          metadata:
            version: 1
            parameters:
              file_name:
                type: str
                description: default file name for child entries
                default: example_file_name
          sources:
            example:
              description: test
              driver: random
              args: {}

            entry1_full:
              description: entry1 full
              metadata:
                foo: 'bar'
                bar: [1, 2, 3]
              driver: csv
              args: # passed to the open() method
                urlpath: '{{ CATALOG_DIR }}/entry1_*.csv'

            entry1_part:
              description: entry1 part
              parameters: # User parameters
                part:
                  description: section of the data
                  type: str
                  default: "stable"
                  allowed: ["latest", "stable"]
              driver: csv
              args:
                urlpath: '{{ CATALOG_DIR }}/entry1_{{ part }}.csv'

            entry2:
              description: entry2
              driver: csv
              args:
                # file_name parameter will be inherited from file-level parameters, so will
                # default to "example_file_name"
                urlpath: '{{ CATALOG_DIR }}/entry2/{{ file_name }}.csv`

   Metadata
       Arbitrary extra descriptive information can go into the metadata section. Some fields will be claimed for
       internal use and some fields may be restricted to local reading; but for  now  the  only  field  that  is
       expected is version, which will be updated when a breaking change is made to the file format. Any catalog
       will have .metadata and .version attributes available.

       Note that each source also has its own metadata.

       The  metadata section an also contain parameters which will be inherited by the sources in the file (note
       that these sources can augment these parameters, or override them with their own parameters).

   Extra drivers
       The driver: entry of a data source specification can be a driver name, as has been shown in the  examples
       so far.  It can also be an absolute class path to use for the data source, in which case there will be no
       ambiguity  about how to load the data. That is the the preferred way to be explicit, when the driver name
       alone is not enough (see Driver Selection, below).

          plugins:
            source:
              - module: intake.catalog.tests.example1_source
          sources:
            ...

       However, you do not, in general, need to do this, since  the  driver:  field  of  each  source  can  also
       explicitly refer to the plugin class.

   Sources
       The  majority of a catalog file is composed of data sources, which are named data sets that can be loaded
       for the user.  Catalog authors describe the contents of data set, how to load it,  and  optionally  offer
       some customization of the returned data.  Each data source has several attributes:

       • name:  The  canonical  name  of the source.  Best practice is to compose source names from valid Python
         identifiers.  This allows Intake to support things like tab completion of data source names on  catalog
         objects.  For example, monthly_downloads is a good source name.

       • description: Human readable description of the source.  To help catalog browsing tools, the description
         should be Markdown.

       • driver:  Name  of  the  Intake Driver to use with this source.  Must either already be installed in the
         current Python environment (i.e. with conda or pip) or loaded in the plugin section of the file. Can be
         a  simple  driver  name  like  "csv"  or   the   full   path   to   the   implementation   class   like
         "package.module.Class".

       • args: Keyword arguments to the init method of the driver.  Arguments may use template expansion.

       • metadata:  Any  metadata  keys  that  should be attached to the data source when opened.  These will be
         supplemented by additional metadata provided by the driver.  Catalog authors can use whatever key names
         they would like, with the exception that keys starting with  a  leading  underscore  are  reserved  for
         future internal use by Intake.

       • direct_access:  Control  whether  the  data  is  directly  accessed by the client, or proxied through a
         catalog server.  See Server Catalogs for more details.

       • parameters: A dictionary of data source parameters.  See below for more details.

   Caching Source Files Locally
       This method of defining the cache  with a dedicated block is deprecated, see the Remote  Access  section,
       below

       To enable caching on the first read of remote data source files, add the cache section with the following
       attributes:

       • argkey: The args section key which contains the URL(s) of the data to be cached.

       • type:  One  of  the  keys  in  the  cache  registry  [intake.source.cache.registry],  referring  to  an
         implementation of caching behaviour. The default is "file" for the caching of one or more files.

       Example:

          test_cache:
            description: cache a csv file from the local filesystem
            driver: csv
            cache:
              - argkey: urlpath
                type: file
            args:
              urlpath: '{{ CATALOG_DIR }}/cache_data/states.csv'

       The cache_dir defaults to ~/.intake/cache, and can be specified  in  the  intake  configuration  file  or
       INTAKE_CACHE_DIR environment variable, or at runtime using the "cache_dir" key of the configuration.  The
       special value "catdir" implies that cached files will appear in the same directory as the catalog file in
       which  the  data source is defined, within a directory named "intake_cache". These will not appear in the
       cache usage reported by the CLI.

       Optionally, the cache section can have a regex attribute, that modifies the path  of  the  cache  on  the
       disk.  By  default, the cache path is made by concatenating cache_dir, dataset name, hash of the url, and
       the url itself (without the protocol). regex attribute allows one to remove part of the url (the matching
       part).

       Caching can be disabled at runtime for all sources regardless of the catalog specification:

          from intake.config import conf

          conf['cache_disabled'] = True

       By default, progress bars are shown during downloads if the package tqdm is available, but  this  can  be
       disabled (e.g., for consoles that don't support complex text) with
          conf['cache_download_progress'] = False

       or, equivalently, the environment parameter INTAKE_CACHE_PROGRESS.

       The  "types" of caching are that supported are listed in intake.source.cache.registry, see the docstrings
       of each for specific parameters that should appear in the cache block.

       It is possible to  work  with  compressed  source  files  by  setting  type:  compression  in  the  cache
       specification.   By default the compression type is inferred from the file extension, otherwise it can be
       set by assigning the decomp variable to any of the  options  listed  in  intake.source.decompress.decomp.
       This  will  extract  all  the  file(s) in the compressed file referenced by urlpath and store them in the
       cache directory.

       In cases where miscellaneous files are present in the compressed file, a regex_filter  parameter  can  be
       used.  Only  the extracted filenames that match the pattern will be loaded. The cache path is appended to
       the filename so it is necessary to include a wildcard to the beginning of the pattern.

       Example:

          test_compressed:
            driver: csv
            args:
              urlpath: 'compressed_file.tar.gz'
            cache:
              - type: compressed
                decomp: tgz
                argkey: urlpath
                regex_filter: '.*data.csv'

   Templating
       Intake catalog files support Jinja2 templating for driver arguments. Any occurrence of a  substring  like
       {{field}}  will  be  replaced  by  the  value  of  the  user parameters with that same name, or the value
       explicitly provided by the user. For how to specify these user parameters, see the next section.

       Some additional values are available for templating. The following is always available: CATALOG_DIR,  the
       full  path to the directory containing the YAML catalog file.  This is especially useful for constructing
       paths relative to the catalog directory to locate data files and custom drivers.  For example, the search
       for CSV files for the two "entry1" blocks, above, will happen in the same directory as where the  catalog
       file was found.

       The  following  functions  may  be  available. Since these execute code, the user of a catalog may decide
       whether they trust those functions or not.

       • env("USER"): look in the set environment variables for the named variable

       • client_env("USER"): exactly the same, except that when using a client-server topology, the  value  will
         come from the environment of the client.

       • shell("get_login  thisuser  -t"): execute the command, and use the output as the value. The output will
         be trimmed of any trailing whitespace.

       • client_shell("get_login thisuser -t"): exactly  the  same,  except  that  when  using  a  client-server
         topology, the value will come from the system of the client.

       The  reason  for  the  "client"  versions of the functions is to prevent leakage of potentially sensitive
       information between client and server by controlling where lookups happen. When working without a server,
       only the ones without "client" are used.

       An example:

          sources:
            personal_source:
              description: This source needs your username
              args:
                url: "http://server:port/user/{{env(USER)}}"

       Here, if the user is named "blogs", the url argument will resolve to "http://server:port/user/blogs";  if
       the environment variable is not defined, it will resolve to "http://server:port/user/"

   Parameter Definition
   Source parameters
       A  source  definition  can  contain  a  "parameters"  block.   Expressed in YAML, a parameter may look as
       follows:

          parameters:
            name:
              description: name to use  # human-readable text for what this parameter means
              type: str  # one of bool, str, int, float, list[str | int | float], datetime, mlist
              default: normal  # optional, value to assume if user does not override
              allowed: ["normal", "strange"]  # optional, list of values that are OK, for validation
              min: "n"  # optional, minimum allowed, for validation
              max: "t"  # optional, maximum allowed, for validation

       A parameter, not to be confused with an argument, can have one of two uses:

       • to provide values for variables to be used in templating  the  arguments.  If  the  pattern  "{{name}}"
         exists  in  any of the source arguments, it will be replaced by the value of the parameter. If the user
         provides a value (e.g., source = cat.entry(name='something")), that will be used, otherwise the default
         value. If there is no user input or default, the empty value appropriate for type is used. The  default
         field allows for the same function expansion as listed for arguments, above.

       • If  an  argument  with  the same name as the parameter exists, its value, after any templating, will be
         coerced to the given type of the parameter and validated against the allowed/max/min. It  is  therefore
         possible  to use the string templating system (e.g., to get a value from the environment), but pass the
         final value as, for example, an integer. It makes no sense to provide a  default  for  this  case  (the
         argument already has a value), but providing a default will not raise an exception.

       • the  "mlist"  type is special: it means that the input must be a list, whose values are chosen from the
         allowed list. This is the only type where the parameter value is not  the  same  type  as  the  allowed
         list's values, e.g., if a list of str is set for allowed, a list of str must also be the final value.

       Note:  the  datetime  type accepts multiple values: Python datetime, ISO8601 string,  Unix timestamp int,
       "now" and  "today".

   Catalog parameters
       You can also define user parameters at the catalog level. This  applies  the  parameter  to  all  entries
       within  that catalog, without having to define it for each and every entry.  Furthermore, catalogs dested
       within the catalog will also inherit the parameter(s).

       For example, with the following spec

          metadata:
            version: 1
            parameters:
              bucket:
                type: str
                description: description
                default: test_bucket
          sources:
            param_source:
              driver: parquet
              description: description
              args:
                urlpath: s3://{{bucket}}/file.parquet
            subcat:
              driver: yaml_file
              path: "{{CATALOG_DIR}}/other.yaml"

       If cat is the corresponsing catalog instance,  the  URL  of  source  cat.param_source  will  evaluate  to
       "s3://test_bucket/file.parquet"    by    default,    but   the   parameter   can   be   overridden   with
       cat.param_source(bucket="other_bucket"). Also, any entries of subcat,  another  catalog  referenced  from
       here,  would  also have the "bucket"-named parameter attached to all sources. Of course, those sources do
       no need to make use of the parameter.

       To change the default, we can gerenate a new instance

          cat2 = cat(bucket="production")  # sets default value of "bucket" for cat2
          subcat = cat.subcat(bucket="production")  # sets default only for the nested catalog

       Of course, in these situations you can still override the value of the parameter for any source, or  pass
       explicit values for the arguments of the source, as normal.

       For  cases  where  the  catalog  is  not  defined  in  a  YAML  spec, the argument user_parameters to the
       constructor takes the same form as parameters above: a dict of user parameters, either  as  UserParameter
       instances or as a dictionary spec for each one.

   Driver Selection
       In some cases, it may be possible that multiple backends are capable of loading from the same data format
       or  service.  Sometimes, this may mean two drivers with unique names, or a single driver with a parameter
       to choose between the different backends.

       However, it is possible that multiple drivers for reading a particular type of data also share  the  same
       driver  name:  for  example, both the intake-iris and the intake-xarray packages contain drivers with the
       name "netcdf", which are capable of reading the same files, but with different  backends.  Here  we  will
       describe the various possibilities of coping with this situation. Intake's plugin system makes it easy to
       encode such choices.

       It  may  be  acceptable to use any driver which claims to handle that data type, or to give the option of
       which driver to use to the user,  or  it  may  be  necessary  to  specify  which  precise  driver(s)  are
       appropriate  for  that  particular  data.  Intake  allows all of these possibilities, even if the backend
       drivers require extra arguments.

       Specifying a single driver explicitly, rather than using a generic name, would look like this:

          sources:
            example:
              description: test
              driver: package.module.PluginClass
              args: {}

       It is also possible to describe a list of drivers with the same syntax. The first one found will  be  the
       one used. Note that the class imports will only happen at data source instantiation, i.e., when the entry
       is selected from the catalog.

          sources:
            example:
              description: test
              driver:
                - package.module.PluginClass
                - another_package.PluginClass2
              args: {}

       These  alternative  plugins  can also be given data-source specific names, allowing the user to choose at
       load time with driver= as a parameter. Additional arguments may also be required for each option  (which,
       as  usual,  may include user parameters); however, the same global arguments will be passed to all of the
       drivers listed.

          sources:
            example:
              description: test
              driver:
                first:
                  class: package.module.PluginClass
                  args:
                    specific_thing: 9
                second:
                  class: another_package.PluginClass2
              args: {}

   Remote Access
       (see also remote_data for the implementation details)

       Many drivers support reading directly from remote data sources such as HTTP, S3 or GCS. In  these  cases,
       the  path  to  read  from is usually given with a protocol prefix such as gcs://. Additional dependencies
       will typically be required (requests, s3fs, gcsfs, etc.), any data package should specify these.  Further
       parameters may be necessary for communicating with the storage backend and,  by  convention,  the  driver
       should take a parameter storage_options containing arguments to pass to the backend. Some remote backends
       may also make use of environment variables or config files to determine their default behaviour.

       The  special template variable "CATALOG_DIR" may be used to construct relative URLs in the arguments to a
       source. In such cases, if the filesystem  used  to  load  that  catalog  contained  arguments,  then  the
       storage_options  of  that  file system will be extracted and passed to the source. Therefore, all sources
       which can accept general URLs (beyond just local paths) must make sure to accept this argument.

       As an example of using storage_options, the following two sources would allow for reading CSV  data  from
       S3 and GCS backends without authentication (anonymous access), respectively

          sources:
            s3_csv:
              driver: csv
              description: "Publicly accessible CSV data on S3; requires s3fs"
              args:
                urlpath: s3://bucket/path/*.csv
                storage_options:
                  anon: true
            gcs_csv:
              driver: csv
              description: "Publicly accessible CSV data on GCS; requires gcsfs"
              args:
                urlpath: gcs://bucket/path/*.csv
                storage_options:
                  token: "anon"

   Caching
       URLs  interpreted  by  fsspec  offer automatic caching. For example, to enable file-based caching for the
       first source above, you can do:

          sources:
            s3_csv:
              driver: csv
              description: "Publicly accessible CSV data on S3; requires s3fs"
              args:
                urlpath: simplecache::s3://bucket/path/*.csv
                storage_options:
                  s3:
                    anon: true

       Here we have added the "simplecache" to the URL (this caching backend does not store any  metadata  about
       the  cached  file)  and  specified  that  the  "anon" parameter is meant as an argument to s3, not to the
       caching mechanism. As each file in s3 is accessed, it will first be downloaded and then the local version
       used instead.

       You can tailor how the caching works. In particular the location of the local storage can be set with the
       cache_storage parameter (under the "simplecache" group of storage_options, of course)  -  otherwise  they
       are  stored  in  a  temporary  location  only  for  the duration of the current python session. The cache
       location  is  particularly  useful  in  conjunction  with  an  environment  variable,  or   relative   to
       "{{CATALOG_DIR}}", wherever the catalog was loaded from.

       Please see the fsspec documentation for the full set of cache types and their various options.

   Local Catalogs
       A Catalog can be loaded from a YAML file on the local filesystem by creating a Catalog object:

          from intake import open_catalog
          cat = open_catalog('catalog.yaml')

       Then sources can be listed:

          list(cat)

       and data sources are loaded via their name:

          data = cat.entry_part1

       and  you  can  optionally  configure  new  instances  of the source to define user parameters or override
       arguments by calling either of:

          data = cat.entry_part1.configure_new(part='1')
          data = cat.entry_part1(part='1')  # this is a convenience shorthand

       Intake also supports loading a catalog from all of the files ending in .yml and .yaml in a directory,  or
       by  using  an  explicit  glob-string. Note that the URL provided may refer to a remote storage systems by
       passing a protocol specifier such as s3://, gcs://.:

          cat = open_catalog('/research/my_project/catalog.d/')

       Intake Catalog objects  will  automatically  reload  changes  or  new  additions  to  catalog  files  and
       directories on disk.  These changes will not affect already-opened data sources.

   Catalog Nesting
       A catalog is just another type of data source for Intake. For example, you can print a YAML specification
       corresponding to a catalog as follows:

          cat = intake.open_catalog('cat.yaml')
          print(cat.yaml())

       results in:

          sources:
            cat:
              args:
                path: cat.yaml
              description: ''
              driver: intake.catalog.local.YAMLFileCatalog
              metadata: {}

       The  point  here,  is  that  this can be included in another catalog.  (It would, of course, be better to
       include a description and the full path of the catalog file here.)  If the  entry  above  were  saved  to
       another file, "root.yaml", and the original catalog contained an entry, data, you could access it as:

          root = intake.open_catalog('root.yaml')
          root.cat.data

       It  is,  therefore,  possible  to build up a hierarchy of catalogs referencing each other.  These can, of
       course, include remote URLs and indeed catalog sources other than simple files (all the tables on  a  SQL
       server,  for instance). Plus, since the argument and parameter system also applies to entries such as the
       example above, it would be possible to give the user a  runtime  choice  of  multiple  catalogs  to  pick
       between, or have this decision depend on an environment variable.

   Server Catalogs
       Intake  also  includes  a server which can share an Intake catalog over HTTP (or HTTPS with the help of a
       TLS-enabled reverse proxy).  From the user perspective, remote catalogs  function  identically  to  local
       catalogs:

          cat = open_catalog('intake://catalog1:5000')
          list(cat)

       The  difference  is  that  operations  on  the  catalog translate to requests sent to the catalog server.
       Catalog servers provide access to data sources in one of two modes:

       • Direct access: In this mode, the catalog server tells the client how to load the data, but  the  client
         uses  its  local  drivers  to  make  the  connection.  This requires the client has the required driver
         already installed and has direct access to the files or data servers that the driver will connect to.

       • Proxied access: In this mode, the catalog server uses its local drivers to open  the  data  source  and
         stream  the  data over the network to the client.  The client does not need any special drivers to read
         the data, and can read data from files and data servers that it cannot access, as long as  the  catalog
         server has the required access.

       Whether  a  particular catalog entry supports direct or proxied access is determined by the direct_access
       option:

       • forbid (default): Force all clients to proxy data through the catalog server

       • allow: If the client has the required driver, access the source  directly,  otherwise  proxy  the  data
         through the catalog server.

       • force:  Force  all  clients  to  access the data directly.  If they do not have the required driver, an
         exception will be raised.

       Note that when the client is loading a data source via direct access, the catalog  server  will  need  to
       send  the  driver  arguments  to  the client.  Do not include sensitive credentials in a data source that
       allows direct access.

   Client Authorization Plugins
       Intake servers can check if clients are authorized to access  the  catalog  as  a  whole,  or  individual
       catalog  entries.   Typically  a  matched  pair  of  server-side  plugin  (called an "auth plugin") and a
       client-side plugin (called a "client auth plugin) need to be enabled for authorization  checks  to  work.
       This feature is still in early development, but see module intake.auth.secret for a demonstration pair of
       server and client classes implementation auth via a shared secret. See auth-plugins.

   Command Line Tools
       The package installs two executable commands: for starting the catalog server; and a client for accessing
       catalogs and manipulating the configuration.

   Configuration
       A file-based configuration service is available to Intake. This file is by default sought at the location
       ~/.intake/conf.yaml,  but  either of the environment variables INTAKE_CONF_DIR or INTAKE_CONF_FILE can be
       used to specify another directory or file. If both are given, the latter takes priority.

       At present, the configuration file might look as follows:

          auth:
            cls: "intake.auth.base.BaseAuth"
          port: 5000
          catalog_path:
            - /home/myusername/special_dir

       These are the defaults, and any parameters not specified will take the values above

       • the Intake Server will listen on port 5000 (this can be overridden on the command line, see below)

       • and the auth system used will be the fully qualified class given (which, for  BaseAuth,  always  allows
         access). For further information on securing the Intake Server, see the authplugins.

       See intake.config.defaults for a full list of keys and their default values.

   Log Level
       The logging level is configurable using Python's built-in logging module.

       The  config  option  'logging' holds the current level for the intake logger, and can take values such as
       'INFO' or 'DEBUG'. This can be set in the conf.yaml file of the config directory (e.g.,  ~/.intake/),  or
       overridden by the environment variable INTAKE_LOG_LEVEL.

       Furthermore, the level and settings of the logger can be changed programmatically in code:

          import logging
          logger = logging.getLogger('intake')
          logger.setLevel(logging.DEBUG)
          logget.addHandler(..)

   Intake Server
       The server takes one or more catalog files as input and makes them available on port 5000 by default.

       You can see the full description of the server command with:

          >>> intake-server --help

          usage: intake-server [-h] [-p PORT] [--list-entries] [--sys-exit-on-sigterm]
                               [--flatten] [--no-flatten] [-a ADDRESS]
                               FILE [FILE ...]

          Intake Catalog Server

          positional arguments:
            FILE                  Name of catalog YAML file

          optional arguments:
            -h, --help            show this help message and exit
            -p PORT, --port PORT  port number for server to listen on
            --list-entries        list catalog entries at startup
            --sys-exit-on-sigterm
                                  internal flag used during unit testing to ensure
                                  .coverage file is written
            --flatten
            --no-flatten
            -a ADDRESS, --address ADDRESS
                                  address to use as a host, defaults to the address in
                                  the configuration file, if provided otherwise localhost
            usage: intake-server [-h] [-p PORT] [--list-entries] [--sys-exit-on-sigterm]
                         [--flatten] [--no-flatten] [-a ADDRESS]
                         FILE [FILE ...]

       To start the server with a local catalog file, use the following:

          >>> intake-server intake/catalog/tests/catalog1.yml
          Creating catalog from:
            - intake/catalog/tests/catalog1.yml
          catalog_args ['intake/catalog/tests/catalog1.yml']
          Entries: entry1,entry1_part,use_example1
          Listening on port 5000

       You can use the catalog client (defined below) using:

          $ intake list intake://localhost:5000
          entry1
          entry1_part
          use_example1

   Intake Client
       While  the  Intake data sources will typically be accessed through the Python API, you can use the client
       to verify a catalog file.

       Unlike the server command, the client has several subcommands to access a catalog. You can see  the  list
       of available subcommands with:

          >>> intake --help
          usage: intake {list,describe,exists,get,discover} ...

       We go into further detail in the following sections.

   List
       This  subcommand lists the names of all available catalog entries. This is useful since other subcommands
       require these names.

       If you wish to see the details about each catalog entry, use the --full  flag.   This  is  equivalent  to
       running the intake describe subcommand for all catalog entries.

          >>> intake list --help
          usage: intake list [-h] [--full] URI

          positional arguments:
            URI         Catalog URI

          optional arguments:
            -h, --help  show this help message and exit
            --full

          >>> intake list intake/catalog/tests/catalog1.yml
          entry1
          entry1_part
          use_example1
          >>> intake list --full intake/catalog/tests/catalog1.yml
          [entry1] container=dataframe
          [entry1] description=entry1 full
          [entry1] direct_access=forbid
          [entry1] user_parameters=[]
          [entry1_part] container=dataframe
          [entry1_part] description=entry1 part
          [entry1_part] direct_access=allow
          [entry1_part] user_parameters=[{'default': '1', 'allowed': ['1', '2'], 'type': u'str', 'name': u'part', 'description': u'part of filename'}]
          [use_example1] container=dataframe
          [use_example1] description=example1 source plugin
          [use_example1] direct_access=forbid
          [use_example1] user_parameters=[]

   Describe
       Given the name of a catalog entry, this subcommand lists the details of the respective catalog entry.

          >>> intake describe --help
          usage: intake describe [-h] URI NAME

          positional arguments:
            URI         Catalog URI
            NAME        Catalog name

          optional arguments:
            -h, --help  show this help message and exit

          >>> intake describe intake/catalog/tests/catalog1.yml entry1
          [entry1] container=dataframe
          [entry1] description=entry1 full
          [entry1] direct_access=forbid
          [entry1] user_parameters=[]

   Discover
       Given  the  name  of a catalog entry, this subcommand returns a key-value description of the data source.
       The exact details are subject to change.

          >>> intake discover --help
          usage: intake discover [-h] URI NAME

          positional arguments:
            URI         Catalog URI
            NAME        Catalog name

          optional arguments:
            -h, --help  show this help message and exit

          >>> intake discover intake/catalog/tests/catalog1.yml entry1
          {'npartitions': 2, 'dtype': dtype([('name', 'O'), ('score', '<f8'), ('rank', '<i8')]), 'shape': (None,), 'datashape':None, 'metadata': {'foo': 'bar', 'bar': [1, 2, 3]}}

   Exists
       Given the name of a catalog entry, this subcommand returns whether or not the respective catalog entry is
       valid.

          >>> intake exists --help
          usage: intake exists [-h] URI NAME

          positional arguments:
            URI         Catalog URI
            NAME        Catalog name

          optional arguments:
            -h, --help  show this help message and exit

          >>> intake exists intake/catalog/tests/catalog1.yml entry1
          True
          >>> intake exists intake/catalog/tests/catalog1.yml entry2
          False

   Get
       Given the name of a catalog entry, this subcommand outputs the entire data source to standard output.

          >>> intake get --help
          usage: intake get [-h] URI NAME

          positional arguments:
            URI         Catalog URI
            NAME        Catalog name

          optional arguments:
            -h, --help  show this help message and exit

          >>> intake get intake/catalog/tests/catalog1.yml entry1
                 name  score  rank
          0    Alice1  100.5     1
          1      Bob1   50.3     2
          2  Charlie1   25.0     3
          3      Eve1   25.0     3
          4    Alice2  100.5     1
          5      Bob2   50.3     2
          6  Charlie2   25.0     3
          7      Eve2   25.0     3

   Config and Cache
       CLI functions starting with intake cache and intake config are available to provide information about the
       system: the locations and value of configuration parameters, and the state of cached files.

   Persisting Data
       (this is an experimental new feature, expect enhancements and changes)

   Introduction
       As defined in the glossary, to Persist is to convert data into the storage format  most  appropriate  for
       the  container  type, and save a copy of this for rapid lookup in the future.  This is of great potential
       benefit where the creation or transfer of the original data source takes some time.

       This is not to be confused with the file Cache.

   Usage
       Any Data Source has a method .persist(). The only option that you will need to pick is a TTL, the  number
       of  seconds  that the persisted version lasts before expiry (leave as None for no expiry). This creates a
       local copy in the persist directory, which may be in "~/.intake/persist, but can be configured.

       Each container type (dataframe, array, ...) will have  its  own  implementation  of  persistence,  and  a
       particular  file  storage  format  associated.  The call to .persist() may take arguments to tune how the
       local files are created, and in some cases may require additional optional packages to be installed.

       Example:

          cat = intake.open_catalog('mycat.yaml')  # load a remote cat
          source = cat.csvsource()  # source pointing to remote data
          source.persist()

          source = cat.csvsource()  # future use now gives local intake_parquet.ParquetSource

       To control whether a catalog will automatically give you the persisted version of a source  in  this  way
       using the argument persist_mode, e.g., to ignore locally persisted versions, you could have done:

          cat = intake.open_catalog('mycat.yaml', persist_mode='never')
          or
          source = cat.csvsource(persist_mode='never')

       Note  that  if you give a TTL (in seconds), then the original source will be accessed and a new persisted
       version written transparently when the old persisted version has expired.

       Note that after persisting, the original source will  have  source.has_been_persisted  ==  True  and  the
       persisted source (i.e., the one loaded from local files) will have source.is_persisted == True.

   Export
       A  similar  concept  to  Persist,  Export  allows  you  to make a copy of some data source, in the format
       appropriate for its container, and place this data-set in whichever location suits you, including  remote
       locations.  This  functionality (source.export()) does not touch the persist store; instead, it returns a
       YAML text representation of the output, so that you can put it into a catalog of your own.  It  would  be
       this catalog that you share with other people.

       Note  that  "exported" data-sources like this do contain the information of the original source they were
       made from in their metadata, so you can recreate the original source, if  you  want  to,  and  read  from
       there.

   Persisting to Remote
       If  you  are typically running your code inside of ephemoral containers, then persisting data-sets may be
       something that you want to do (because the original source is slow, or parsing is CPU/memory  intensive),
       but  the  local  storage  is not useful. In some cases you may have access to some shared network storage
       mounted on the instance, but in other cases you will want to persist to a remote store.

       The config value 'persist_path', which can also be set by the  environment  variable  INTAKE_PERSIST_PATH
       can  be  a remote location such as s3://mybucket/intake-persist. You will need to install the appropriate
       package to talk to the external storage (e.g., s3fs, gcsfs, pyarrow),  but  otherwise  everything  should
       work as before, and you can access the persisted data from any container.

   The Persist Store
       You can interact directly with the class implementing persistence:

          from intake.container.persist import store

       This  singleton  instance,  which  acts  like a catalog, allows you to query the contents of the instance
       store and to add and remove entries. It also allows you  to  find  the  original  source  for  any  given
       persisted source, and refresh the persisted version on demand.

       For    details    on    the    methods    of    the   persist   store,   see   the   API   documentation:
       intake.container.persist.PersistStore(). Sources in the store  carry  a  lot  of  information  about  the
       sources  they  were  made  from,  so that they can be remade successfully. This all appears in the source
       metadata.  The sources use the "token" of the original data source as their key in  the  store,  a  value
       which  can  be  found  by  dask.base.tokenize(source)  for  the original source, or can be taken from the
       metadata of a persisted source.

       Note that all of the information about persisted sources is held in a single YAML  file  in  the  persist
       directory     (typically     /persisted/cat.yaml     within     the    config    directory,    but    see
       intake.config.conf['persist_path']). This file can be edited by hand if you wanted to, for  example,  set
       some persisted source not to expire. This is only recommended for experts.

   Future Enhancements
       • CLI functionality to investigate and alter the state of the persist store.

       • Time check-pointing of persisted data, such that you can not only get the "most recent" but any version
         in the time-series.

       • (eventually)  pipeline functionality, whereby a persisted data source depends on another persisted data
         source, and the whole train can be refreshed on a schedule or on demand.

   Plotting
       Intake provides a plotting API based on the hvPlot library, which closely mirrors the pandas plotting API
       but generates interactive plots using HoloViews and Bokeh.

       The hvPlot website provides comprehensive documentation on using the plotting API  to  quickly  visualize
       and explore small and large datasets. The main features offered by the plotting API include:

          • Support for tabular data stored in pandas and dask dataframes

          • Support for gridded data stored in xarray backed nD-arrays

          • Support for plotting large datasets with datashader

       Using  Intake  alongside  hvPlot allows declaratively persisting plot declarations and default options in
       the regular catalog.yaml files.

   Setup
       For detailed installation instructions see the getting started section in the hvPlot  documentation.   To
       start with install hvplot using conda:

          conda install -c conda-forge hvplot

       or using pip:

          pip install hvplot

   Usage
       The  plotting  API is designed to work well in and outside the Jupyter notebook, however when using it in
       JupyterLab the PyViz lab extension must be installed first:

          jupyter labextension install @pyviz/jupyterlab_pyviz

       For detailed instructions on displaying plots in the notebook and from the Python command prompt see  the
       hvPlot user guide.

   Python Command Prompt & Scripts
       Assuming  the  US Crime dataset has been installed (in the intake-examples repo, or from conda with conda
       install -c intake us_crime):

       Once installed the plot API can be used, by using the .plot method on an intake DataSource:

          import intake
          import hvplot as hp

          crime = intake.cat.us_crime
          columns = ['Burglary rate', 'Larceny-theft rate', 'Robbery rate', 'Violent Crime rate']

          violin = crime.plot.violin(y=columns, group_label='Type of crime',
                                     value_label='Rate per 100k', invert=True)
          hp.show(violin)
       [image]

   Notebook
       Inside the notebook plots will display themselves, however the notebook extension must be  loaded  first.
       The  extension  may  be  loaded  by  importing  hvplot.intake  module or explicitly loading the holoviews
       extension, or by calling intake.output_notebook():

          # To load the extension run this import
          import hvplot.intake

          # Or load the holoviews extension directly
          import holoviews as hv
          hv.extension('bokeh')

          # convenience function
          import intake
          intake.output_notebook()

          crime = intake.cat.us_crime
          columns = ['Violent Crime rate', 'Robbery rate', 'Burglary rate']
          crime.plot(x='Year', y=columns, value_label='Rate (per 100k people)')

   Predefined Plots
       Some catalogs will define plots appropriate to a specific data source. These will be specified such  that
       the user gets the right view with the right columns and labels, without having to investigate the data in
       detail -- this is ideal for quick-look plotting when browsing sources.

          import intake
          intake.us_crime.plots

       Returns ['example']. This works whether accessing the entry object or the source instance. To visualise

          intake.us_crime.plot.example()

   Persisting metadata
       Intake allows catalog yaml files to declare metadata fields for each data source which are made available
       alongside the actual dataset. The plotting API reserves certain fields to define default plot options, to
       label and annotate the data fields in a dataset and to declare pre-defined plots.

   Declaring defaults
       The first set of metadata used by the plotting API is the plot field in the metadata section. Any options
       found  in  the  metadata  field  will  apply  to  all plots generated from that data source, allowing the
       definition of plotting defaults. For example when plotting a fairly large dataset such as  the  NYC  Taxi
       data,  it  might  be desirable to enable datashader by default ensuring that any plot that supports it is
       datashaded. The syntax to declare default plot options is as follows:

          sources:
            nyc_taxi:
              description: NYC Taxi dataset
              driver: parquet
              args:
                urlpath: 's3://datashader-data/nyc_taxi_wide.parq'
              metadata:
                plot:
                  datashade: true

   Declaring data fields
       The columns of a CSV or parquet file or the coordinates and data variables in a NetCDF  file  often  have
       shortened,  or  cryptic names with underscores. They also do not provide additional information about the
       units of the data or the range of values, therefore the catalog  yaml  specification  also  provides  the
       ability to define additional information about the fields in a dataset.

       Valid attributes that may be defined for the data fields include:

       • label: A readable label for the field which will be used to label axes and widgets

       • unit: A unit associated with the values inside a data field

       • range:  A  range  associated  with a field declaring limits which will override those computed from the
         data

       Just like the default plot options the fields may be declared  under  the  metadata  section  of  a  data
       source:

          sources:
            nyc_taxi:
              description: NYC Taxi dataset
              driver: parquet
              args:
                urlpath: 's3://datashader-data/nyc_taxi_wide.parq'
              metadata:
                fields:
                  dropoff_x:
                    label: Longitude
                  dropoff_y:
                    label: Latitude
                  total_fare:
                    label: Fare
                    unit: $

   Declaring custom plots
       As  shown  in  the  hvPlot  user  guide,  the plotting API provides a variety of plot types, which can be
       declared  using  the  kind  argument  or  via   convenience   methods   on   the   plotting   API,   e.g.
       cat.source.plot.scatter().  In addition to declaring default plot options and field metadata data sources
       may also declare custom plot, which will be made available as methods on the plotting API. In this way  a
       catalogue may declare any number of custom plots alongside a datasource.

       To  make  this  more  concrete  consider  the following custom plot declaration on the plots field in the
       metadata section:

          sources:
            nyc_taxi:
              description: NYC Taxi dataset
              driver: parquet
              args:
                urlpath: 's3://datashader-data/nyc_taxi_wide.parq'
              metadata:
                plots:
                  dropoff_scatter:
                    kind: scatter
                    x: dropoff_x
                    y: dropoff_y
                    datashade: True
                    width: 800
                    height: 600

       This declarative specification creates a new custom plot called dropoff_scatter, which will be  available
       on  the  catalog  under  cat.nyc_taxi.plot.dropoff_scatter().  Calling  this  method on the plot API will
       automatically generate a datashaded scatter plot of the dropoff locations in the NYC taxi dataset.

       Of course the three metadata fields may also be used together, declaring global defaults under  the  plot
       field, annotations for the data fields under the fields key and custom plots via the plots field.

   Plugin Directory
       This  is  a  list  of  known projects which install driver plugins for Intake, and the named drivers each
       contains in parentheses:

       • builtin  to  Intake  (catalog,   csv,   intake_remote,   ndzarr,   numpy,   textfiles,   yaml_file_cat,
         yaml_files_cat, zarr_cat, json, jsonl)

       • intake-astro Table and array loading of FITS astronomical data (fits_array, fits_table)

       • intake-accumulo Apache Accumulo clustered data storage (accumulo)

       • intake-avro: Apache Avro data serialization format (avro_table, avro_sequence)

       • intake-bluesky: search and retrieve data in the bluesky data model

       • intake-dcat Browse and load data from DCAT catalogs. (dcat)

       • intake-dynamodb link to Amazon DynamoDB (dynamodb)

       • intake-elasticsearch:     Elasticsearch     search    and    analytics    engine    (elasticsearch_seq,
         elasticsearch_table)

       • intake-esm:  Plugin for building and loading intake catalogs for earth system data sets holdings,  such
         as CMIP (Coupled Model Intercomparison Project) and CESM Large Ensemble datasets.

       • intake-geopandas:  load  from  ESRI  Shape  Files,  GeoJSON,  and  geospatial  databases with geopandas
         (geojson, postgis, shapefile, spatialite) and regionmask for opening shapefiles into regionmask.

       • intake-google-analytics:   run   Google   Analytics   queries   and   load   data   as   a    DataFrame
         (google_analytics_query)

       • intake-hbase: Apache HBase database (hbase)

       • intake-iris load netCDF and GRIB files with IRIS (grib, netcdf)

       • intake-metabase:  Generate  catalogs  and  load  tables  as DataFrames from Metabase (metabase_catalog,
         metabase_table)

       • intake-mongo: MongoDB noSQL query (mongo)

       • intake-nested-yaml-catalog: Plugin supporting a single YAML hierarchical catalog to  organize  datasets
         and avoid a data swamp. (nested_yaml_cat)

       • intake-netflow: Netflow packet format (netflow)

       • intake-notebook:  Experimental plugin to access parameterised notebooks through intake and executed via
         papermill (ipynb)

       • intake-odbc: ODBC database (odbc)

       • intake-parquet: Apache Parquet file format (parquet)

       • intake-pattern-catalog: Plugin for specifying a file-path pattern  which  can  represent  a  number  of
         different entries (pattern_cat)

       • intake-pcap: PCAP network packet format (pcap)

       • intake-postgres: PostgreSQL database (postgres)

       • intake-s3-manifests (s3_manifest)

       • intake-salesforce: Generate catalogs and load tables as DataFrames from Salesforce (salesforce_catalog,
         salesforce_table)

       • intake-sklearn: Load scikit-learn models from Pickle files (sklearn)

       • intake-solr: Apache Solr search platform (solr)

       • intake-stac: Intake Driver for SpatioTemporal Asset Catalogs (STAC).

       • intake-stripe:   Generate   catalogs  and  load  tables  as  DataFrames  from  Stripe  (stripe_catalog,
         stripe_table)

       • intake-spark: data processed by Apache Spark (spark_cat, spark_rdd, spark_dataframe)

       • intake-sql: Generic SQL queries via SQLAlchemy (sql_cat, sql, sql_auto, sql_manual)

       • intake-splunk: Splunk machine data query (splunk)

       • intake-streamz: real-time event processing using Streamz (streamz)

       • intake-thredds: Intake interface to THREDDS data catalogs (thredds_cat, thredds_merged_source)

       • intake-xarray: load netCDF, Zarr and other multi-dimensional data (xarray_image, netcdf, grib, opendap,
         rasterio, remote-xarray, zarr)

       The status of these projects is available at Status Dashboard.

       Don't see your favorite format?  See making-plugins for how to create new plugins.

       Note that if you want your plugin listed here, open an issue in the Intake issue repository  and  add  an
       entry  to  the  status  dashboard  repository. We also have a plugin wishlist Github issue that shows the
       breadth of plugins we hope to see for Intake.

   Server Protocol
       This page gives deeper details on how the Intake server is implemented. For those simply wishing  to  run
       and configure a server, see the tools section.

       Communication  between  the  intake  client and server happens exclusively over HTTP, with all parameters
       passed using msgpack UTF8 encoding. The server side  is  implemented  by  the  module  intake.cli.server.
       Currently, only the following two routes are available:

          • http://server:port/v1/info

          • http://server:port/v1/source.

       The  server  may  be configured to use auth services, which, when passed the header of the incoming call,
       can determine whether the given request is allowed. See auth-plugins.

   GET /info
       Retrieve information about the data-sets  available  on  this  server.  The  list  of  data-sets  may  be
       paginated,  in  order to avoid excessively long transactions. Notice that the catalog for which a listing
       is being requested can itself be  a  data-source  (when  source-id  is  passed)  -  this  is  how  nested
       sub-catalogs are handled on the server.

   Parameters
       • page_size, int or none (optional): to enable pagination, set this value. The number of entries returned
         will be this value at most. If None, returns all entries. This is passed as a query parameter.

       • page_offset,  int  (optional): when paginating, start the list from this numerical offset. The order of
         entries is guaranteed if the base catalog has not changed. This is passed as a query parameter.

       • source-id, uuid string (optional): when the catalog being accessed is not the  route  catalog,  but  an
         open  data-source  on the server, this is its unique identifier. See POST /source for how these IDs are
         generated.  If the catalog being accessed is the root Catalog, this parameter should be  omitted.  This
         is passed as an HTTP header.

   Returns
       • version, string: the server's Intake version

       • sources,  list  of  objects:  the  main  payload,  where each object contains a name, and the result of
         calling .describe() on the corresponding data-source, i.e., the container type, description, metadata.

       • metadata, object: any metadata associated with the whole catalog

   GET /source
       Fetch information about a specific source. This is the random-access variant of the GET /info  route,  by
       which a particular data-source can be accessed without paginating through all of the sources.

   Parameters
       • name,  string  (required): the data source name being accessed, one of the members of the catalog. This
         is passed as a query parameter.

       • source-id, uuid string (optional): when the catalog being accessed is not the root catalog, but an open
         data-source on the server, this is its unique identifier. See  POST  /source  for  how  these  IDs  are
         generated.   If  the catalog being accessed is the root Catalog, this parameter should be omitted. This
         is passed as an HTTP header.

   Returns
       Same as one of the entries in sources for GET /info: the result of .describe() on the  given  data-source
       in the server

   POST /source, action="search"
       Searching a Catalog returns search results in the form of a new Catalog. This "results" Catalog is cached
       on the server the same as any other Catalog.

   Parameters
       • source-id,  uuid  string  (optional):  When  the  catalog being searched is not the root catalog, but a
         subcatalog on the server, this is its unique identifier. If the catalog  being  searched  is  the  root
         Catalog, this parameter should be omitted. This is passed as an HTTP header.

       • query:  tuple of (args, kwargs): These will be unpacked into Catalog.search on the server to create the
         "results" Catalog. This is passed in the body of the message.

   Returns
       • source_id, uuid string: the identifier of the results Catalog in the server's source cache

   POST /source, action="open"
       This is a more involved processing of a data-source, and, if successful,  returns  one  of  two  possible
       scenarios:

       • direct-access,  in  which  all  the  details required for reading the data directly from the client are
         passed, and the client then creates a local copy of the data source and needs  no  further  involvement
         from the server in order to fetch the data

       • remote-access, in which the client is unable or unwilling to create a local version of the data-source,
         and instead created a remote data-source which will fetch the data for each partition from the server.

       The  set  of  parameters  supplied  and  the server/client policies will define which method of access is
       employed. In the case of remote-access, the data source is instantiated on the  server,  and  .discover()
       run  on  it.  The  resulting  information is passed back, and must be enough to instantiate a subclass of
       intake.container.base.RemoteSource appropriate for the container  of  the  data-set  in  question  (e.g.,
       RemoteArray  when  container="ndarray").   In this case, the response also includes a UUID string for the
       open instance on the server, referencing the cache of open sources maintained by the server.

       Note that "opening" a data entry which is itself is a catalog implies instantiating that  catalog  object
       on the server and returning its UUID, such that a listing can be made using GET/ info or GET /source.

   Parameters
       • name,  string  (required): the data source name being accessed, one of the members of the catalog. This
         is passed in the body of the request.

       • source-id, uuid string (optional): when the catalog being accessed is not the root catalog, but an open
         data-source on the server, this is its unique identifier. If the catalog being  accessed  is  the  root
         Catalog, this parameter should be omitted. This is passed as an HTTP header.

       • available_plugins, list of string (optional): the set of named data drivers supported by the client. If
         the  driver  required by the data-source is not supported by the client, then the source must be opened
         remote-access. This is passed in the body of the request.

       • parameters, object (optional): user parameters to pass to the data-source when  instantiating.  Whether
         or  not  direct-access is possible may, in principle, depend on these parameters, but this is unlikely.
         Note that some parameter default value functions are designed to be evaluated on the server, which  may
         have  access  to,  for example, some credentials service (see paramdefs). This is passed in the body of
         the request.

   Returns
       If direct-access, the driver plugin name and set of arguments for instantiating the  data-soruce  in  the
       client.

       If  remote-access, the data-source container, schema and source-ID so that further reads can be made from
       the server.

   POST /source, action="read"
       This route fetches data from the server once a data-source has been opened in remote-access mode.

   Parameters
       • source-id, uuid string (required): the identifier of the data-source in the server's source cache. This
         is returned when action="open". This is passed in the body of the request.

       • partition, int or tuple (optional, but necessary for some sources): section/chunk of the data to fetch.
         In cases where the data-source is partitioned, the client will fetch the data one partition at a  time,
         so  that  it  will appear partitioned in the same manner on the client side for iteration of passing to
         Dask. Some data-sources do not support partitioning, and then this parameter is  not  required/ignored.
         This is passed in the body of the request.

       • accepted_formats,  accepted_compression,  list  of  strings (required): to specify how serialization of
         data happens. This is an expert feature, see docs in the module  intake.container.serializer.  This  is
         passed in the body of the request.

   Dataset Transforms
       aka. derived datasets.

       WARNING:
          experimental  feature,  the  API  may  change.  The  data sources in intake.source.derived are not yet
          declared as top-level named drivers in the package entrypoints.

       Intake allows for the definition of data sources which take as their input another  source  in  the  same
       directory, so that you have the opportunity to present processing to the user of the catalog.

       The  "target" or a derived data source will normally be a string. In the simple case, it is the name of a
       data source in the same catalog. However, we use the syntax "catalog:source" to refer to sources in other
       catalogs, where the part before ":" will be passed to intake.open_catalog(), together  with  any  keyword
       arguments from cat_kwargs.

       This can be done by defining classes which inherit from intake.source.derived.DerivedSource, or using one
       of  the pre-defined classes in the same module, which usually need to be passed a reference to a function
       in a python module. We will demonstrate both.

   Example
       Consider the following target dataset, which loads some simple facts about US states  from  a  CSV  file.
       This  example is taken from the Intake test suite.

       We now show two ways to apply a super-simple transform to this data, which selects two of the dataframe's
       columns.

   Class Example
       The  first  version  uses  an  approach in which the transform is derived in a data source class, and the
       parameters passed are specific to the transform type.  Note that  the  driver  is  referred  to  by  it's
       fully-qualified name in the Intake package.

       The source class for this is included in the Intake codebase, but the important part is:

          class Columns(DataFrameTransform):
              ...

              def pick_columns(self, df):
                  return df[self._params["columns"]]

       We  see  that  this specific class inherits from DataFrameTransform, with transform=self.pick_columns. We
       know that the inputs and outputs are both dataframes. This allows for some additional validation  and  an
       automated way to infer the output dataframe's schema that reduces the number of line of code required.

       The  given  method does exactly what you might imagine: it takes and input dataframe and applies a column
       selection to it.

       Running cat.derive_cols.read() will indeed, as expected, produce a version of  the  data  with  only  the
       selected columns included. It does this by defining the original dataset, appying the selection, and then
       getting  Dask  to generate the output. For some datasets, this can mean that the selection is pushed down
       to the reader, and the data for the dropped columns is never loaded. The user may choose to do .to_dask()
       instead, and manipulate the lazy dataframe directly, before loading.

   Functional Example
       This   second   version   of   the    same    output    uses    the    more    generic    and    flexible
       intake.source.derived.DataFrameTransform.

          derive_cols_func:
            driver: intake.source.derived.DataFrameTransform
            args:
              targets:
                - input_data
              transform: "intake.source.tests.test_derived._pick_columns"
              transform_kwargs:
                columns: ["state", "slug"]

       In this case, we pass a reference to a function defined in the Intake test suite.  Normally this would be
       declared  in  user modules, where perhaps those declarations and catalog(s) are distributed together as a
       package.

          def _pick_columns(df, columns):
              return df[columns]

       This is, of course, very similar to the method shown in the  previous  section,  and  again  applies  the
       selection  in  the  given named argument to the input. Note that Intake does not support including actual
       code in your catalog, since we would not want to allow arbitrary execution of code on  catalog  load,  as
       opposed to execution.

       Loading  this data source proceeds exactly the same way as the class-based approach, above. Both Dask and
       in-memory (Pandas, via .read()) methods work as expected.  The declaration in YAML,  above,  is  slightly
       more  verbose,  but  the  amount of code is smaller. This demonstrates a tradeoff between flexibility and
       concision. If there were validation code to add for the arguments or input  dataset,  it  would  be  less
       obvious where to put these things.

   Barebone Example
       The  previous  two  examples  both  did  dateframe  to  dataframe  transforms. However, totally arbitrary
       computations are possible. Consider the following:

          barebones:
            driver: intake.source.derived.GenericTransform
            args:
              targets:
                - input_data
              transform: builtins.len
              transform_kwargs: {}

       This applies len to the input dataframe. cat.barebones.describe() gives  the  output  container  type  as
       "other", i.e., not specified. The result of read() on this gives the single number 50, the number of rows
       in  the  input  data. This class, and DerivedDataSource and included with the intent as superclasses, and
       probably will not be used directly often.

   Execution engine
       None of the above example specified explicitly where the compute implied by the transformation will  take
       place.  However,  most  Intake  drivers support in-memory containers and Dask; remembering that the input
       dataset here is a dataframe. However, the behaviour is defined in the driver class itself - so  it  would
       be  fine  to write a driver in which we make different assumptions. Let's suppose, for instance, that the
       original source is to be loaded from spark (see the intake-spark package), the  driver  could  explicitly
       call .to_spark on the original source, and be assured that it has a Spark object to work with. It should,
       of course, explain in its documentation what assumptions are being made and that, presumably, the user is
       expected to also call .to_spark if they wished to directly manipulate the spark object.

   API
              ┌───────────────────────────────────────────────┬───────────────────────────────────────┐
              │ intake.source.derived.DerivedSource(*args,    │ Base  source  deriving  from  another │
              │ ...)                                          │ source in the same catalog            │
              ├───────────────────────────────────────────────┼───────────────────────────────────────┤
              │ intake.source.derived.GenericTransform(...)   │                                       │
              ├───────────────────────────────────────────────┼───────────────────────────────────────┤
              │ intake.source.derived.DataFrameTransform(...) │ Transform where the input and  output │
              │                                               │ are both Dask-compatible dataframes   │
              ├───────────────────────────────────────────────┼───────────────────────────────────────┤
              │ intake.source.derived.Columns(*args,          │ Simple  dataframe  transform  to pick │
              │ **kwargs)                                     │ columns                               │
              └───────────────────────────────────────────────┴───────────────────────────────────────┘

       class intake.source.derived.DerivedSource(*args, **kwargs)
              Base source deriving from another source in the same catalog

              Target picking and parameter validation are performed here, but you probably want to subclass from
              one of the more specific classes like DataFrameTransform.

              __init__(targets, target_chooser=<function first>, target_kwargs=None, cat_kwargs=None,
              container=None, metadata=None, **kwargs)

                     Parameters

                            targets: list of string or DataSources
                                   If string(s), refer to entries of the same catalog as this Source

                            target_chooser: function to choose between targets
                                   function(targets, cat) -> source, or a fully-qualified dotted string pointing
                                   to it

                            target_kwargs: dict of dict with keys matching items of targets

                            cat_kwargs: to pass to intake.open_catalog, if the target is in
                                   another catalog

                            container: str (optional)
                                   Assumed output container, if known/different from input

                            [Note: the exact form of target_kwargs and cat_kwargs may be

                            subject to change]

       class intake.source.derived.GenericTransform(*args, **kwargs)

              __init__(targets, target_chooser=<function first>, target_kwargs=None, cat_kwargs=None,
              container=None, metadata=None, **kwargs)

                     Parameters

                            targets: list of string or DataSources
                                   If string(s), refer to entries of the same catalog as this Source

                            target_chooser: function to choose between targets
                                   function(targets, cat) -> source, or a fully-qualified dotted string pointing
                                   to it

                            target_kwargs: dict of dict with keys matching items of targets

                            cat_kwargs: to pass to intake.open_catalog, if the target is in
                                   another catalog

                            container: str (optional)
                                   Assumed output container, if known/different from input

                            [Note: the exact form of target_kwargs and cat_kwargs may be

                            subject to change]

       class intake.source.derived.DataFrameTransform(*args, **kwargs)
              Transform where the input and output are both Dask-compatible dataframes

              This derives from GenericTransform, and you mus supply transform and any transform_kwargs.

              __init__(targets, target_chooser=<function first>, target_kwargs=None, cat_kwargs=None,
              container=None, metadata=None, **kwargs)

                     Parameters

                            targets: list of string or DataSources
                                   If string(s), refer to entries of the same catalog as this Source

                            target_chooser: function to choose between targets
                                   function(targets, cat) -> source, or a fully-qualified dotted string pointing
                                   to it

                            target_kwargs: dict of dict with keys matching items of targets

                            cat_kwargs: to pass to intake.open_catalog, if the target is in
                                   another catalog

                            container: str (optional)
                                   Assumed output container, if known/different from input

                            [Note: the exact form of target_kwargs and cat_kwargs may be

                            subject to change]

       class intake.source.derived.Columns(*args, **kwargs)
              Simple dataframe transform to pick columns

              Given as an example of how to make a specific  dataframe  transform.   Note  that  you  could  use
              DataFrameTransform  directly,  by  writing a function to choose the columns instead of a method as
              here.

              __init__(columns, **kwargs)

                     columns: list of labels (usually str) or slice
                            Columns to choose from the target dataframe

REFERENCE

   API
       Auto-generated reference

   End User
       These are reference class and function definitions likely to be useful to everyone.
             ┌──────────────────────────────────────────────────┬───────────────────────────────────────┐
             │ intake.open_catalog([uri])                       │ Create a Catalog object               │
             ├──────────────────────────────────────────────────┼───────────────────────────────────────┤
             │ intake.registry                                  │                                       │
             ├──────────────────────────────────────────────────┼───────────────────────────────────────┤
             │ intake.register_driver(name, driver[,            │ Add a driver to intake.registry.      │
             │ overwrite])                                      │                                       │
             ├──────────────────────────────────────────────────┼───────────────────────────────────────┤
             │ intake.unregister_driver(name)                   │ Ensure  that  a  given  name  in  the │
             │                                                  │ registry is cleared.                  │
             ├──────────────────────────────────────────────────┼───────────────────────────────────────┤
             │ intake.upload(data, path, **kwargs)              │ Given  a  concrete data object, store │
             │                                                  │ it at given location return Source    │
             ├──────────────────────────────────────────────────┼───────────────────────────────────────┤
             │ intake.source.csv.CSVSource(*args,               │ Read CSV files into dataframes        │
             │ **kwargs)                                        │                                       │
             ├──────────────────────────────────────────────────┼───────────────────────────────────────┤
             │ intake.source.textfiles.TextFilesSource(...)     │ Read textfiles as sequence of lines   │
             ├──────────────────────────────────────────────────┼───────────────────────────────────────┤
             │ intake.source.jsonfiles.JSONFileSource(...)      │ Read   JSON   files   as   a   single │
             │                                                  │ dictionary or list                    │
             ├──────────────────────────────────────────────────┼───────────────────────────────────────┤
             │ intake.source.jsonfiles.JSONLinesFileSource(...) │ Read a JSONL (https://jsonlines.org/) │
             │                                                  │ file  and  return  a list of objects, │
             │                                                  │ each being valid json object (e.g.    │
             ├──────────────────────────────────────────────────┼───────────────────────────────────────┤
             │ intake.source.npy.NPySource(*args, **kwargs)     │ Read numpy binary files into an array │
             ├──────────────────────────────────────────────────┼───────────────────────────────────────┤
             │ intake.source.zarr.ZarrArraySource(*args, ...)   │ Read Zarr format files into an array  │
             ├──────────────────────────────────────────────────┼───────────────────────────────────────┤
             │ intake.catalog.local.YAMLFileCatalog(*args, ...) │ Catalog as described by a single YAML │
             │                                                  │ file                                  │
             ├──────────────────────────────────────────────────┼───────────────────────────────────────┤
             │ intake.catalog.local.YAMLFilesCatalog(*args,     │ Catalog as described  by  a  multiple │
             │ ...)                                             │ YAML files                            │
             ├──────────────────────────────────────────────────┼───────────────────────────────────────┤
             │ intake.catalog.zarr.ZarrGroupCatalog(*args, ...) │ A  catalog  of  the members of a Zarr │
             │                                                  │ group.                                │
             └──────────────────────────────────────────────────┴───────────────────────────────────────┘

       intake.open_catalog(uri=None, **kwargs)
              Create a Catalog object

              Can load YAML catalog files, connect to an intake server, or create any arbitrary Catalog subclass
              instance. In the general case, the user should supply  driver=  with  a  value  from  the  plugins
              registry  which  has  a  container  type  of  catalog.  File locations can generally be remote, if
              specifying a URL protocol.

              The default behaviour if not specifying the driver is as follows:

              • if uri is a a single string ending in "yml" or "yaml", open it as a catalog file

              • if uri is a list of strings, a string containing a glob character ("*") or a string  not  ending
                in "y(a)ml", open as a set of catalog files. In the latter case, assume it is a directory.

              • if uri beings with protocol "intake:", connect to a remote Intake server

              • if uri is None or missing, create a base Catalog object without entries.

              Parameters

                     uri: str or pathlib.Path
                            Designator for the location of the catalog.

                     kwargs:
                            passed  to  subclass  instance, see documentation of the individual catalog classes.
                            For example, yaml_files_cat (when specifying multiple uris or a glob  string)  takes
                            the additional parameter flatten=True|False, specifying whether all data sources are
                            merged in a single namespace, or each file becomes a sub-catalog.

              SEE ALSO:

                 intake.open_yaml_files_cat, intake.open_yaml_file_cat

                 intake.open_intake_remote

       intake.registry
              Mapping  from plugin names to the DataSource classes that implement them. These are the names that
              should appear in the driver: key of each source definition in a catalog. See plugin-directory  for
              more details.

       intake.open_
              Set  of functions, one for each plugin, for direct opening of a data source. The names are derived
              from the names of the plugins in the registry at import time.

       intake.upload(data, path, **kwargs)
              Given a concrete data object, store it at given location return Source

              Use this function to publicly share data which you have created in  your  python  session.  Intake
              will  try  each of the container types, to see if one of them can handle the input data, and write
              the data to the path given, in the format most appropriate for the data type,  e.g.,  parquet  for
              pandas or dask data-frames.

              With  the  DataSource  instance  you get back, you can add this to a catalog, or just get the YAML
              representation for editing (.yaml()) and sharing.

              Parameters

                     data   instance The object to upload and store.  In  many  cases,  the  dask  or  in-memory
                            variant are handled equivalently.

                     path   str  Location of the output files; can be, for instance, a network drive for sharing
                            over a VPC, or a bucket on a cloud storage service

                     kwargs passed to the writer for fine control.UNINDENT

                     Returns

                            DataSource instance

   Source classes
       class intake.source.csv.CSVSource(*args, **kwargs)
              Read CSV files into dataframes

              Prototype of sources reading dataframe data

              __init__(urlpath, csv_kwargs=None, metadata=None, storage_options=None, path_as_pattern=True)

                     Parameters

                            urlpath
                                   str or iterable, location of data May be a local  path,  or  remote  path  if
                                   including a protocol specifier such as 's3://'. May include glob wildcards or
                                   format pattern strings.  Some examples:

                                   • {{ CATALOG_DIR }}data/precipitation.csv

                                   • s3://data/*.csv

                                   • s3://data/precipitation_{state}_{zip}.csv

                                   • s3://data/{year}/{month}/{day}/precipitation.csv

                                   • {{ CATALOG_DIR }}data/precipitation_{date:%Y-%m-%d}.csv

                            csv_kwargs
                                   dict Any further arguments to pass to Dask's read_csv (such as block size) or
                                   to  the  CSV  parser  in  pandas  (such  as  which  columns to use, encoding,
                                   data-types)

                            storage_options
                                   dict Any parameters that need to be passed to the remote data  backend,  such
                                   as credentials.

                            path_as_pattern
                                   bool  or  str,  optional  Whether  to  treat  the  path  as  a  pattern  (ie.
                                   data_{field}.csv) and create new  columns  in  the  output  corresponding  to
                                   pattern fields. If str, is treated as pattern to match on. Default is True.

              discover()
                     Open resource and populate the source attributes.

              export(path, **kwargs)
                     Save this data for sharing with other people

                     Creates  a  copy  of  the  data  in a format appropriate for its container, in the location
                     specified (which can be remote, e.g., s3).

                     Returns the resultant source object, so that you can, for instance, add  it  to  a  catalog
                     (catalog.add(source)) or get its YAML representation (.yaml()).

              persist(ttl=None, **kwargs)
                     Save data from this source to local persistent storage

                     Parameters

                            ttl: numeric, optional
                                   Time  to  live  in seconds. If provided, the original source will be accessed
                                   and a new persisted version written transparently when more than ttl  seconds
                                   have passed since the old persisted version was written.

                            kargs: passed to the _persist method on the base container.

              read() Load entire dataset into a container and return it

              read_partition(i)
                     Return a part of the data corresponding to i-th partition.

                     By  default, assumes i should be an integer between zero and npartitions; override for more
                     complex indexing schemes.

              to_dask()
                     Return a dask container for this data source

       class intake.source.zarr.ZarrArraySource(*args, **kwargs)
              Read Zarr format files into an array

              Zarr is an numerical array storage format which works particularly well with remote  and  parallel
              access.  For specifics of the format, see https://zarr.readthedocs.io/en/stable/

              __init__(urlpath, storage_options=None, component=None, metadata=None, **kwargs)
                     The parameters dtype and shape will be determined from the first file, if not given.

                     Parameters

                            urlpath
                                   str Location of data file(s), possibly including protocol information

                            storage_options
                                   dict Passed on to storage backend for remote files

                            component
                                   str  or None If None, assume the URL points to an array. If given, assume the
                                   URL points to a group, and descend the  group  to  find  the  array  at  this
                                   location in the hierarchy.

                            kwargs passed on to dask.array.from_zarr.UNINDENT

                     discover()
                            Open resource and populate the source attributes.

                     export(path, **kwargs)
                            Save this data for sharing with other people

                            Creates  a  copy  of  the  data  in  a  format appropriate for its container, in the
                            location specified (which can be remote, e.g., s3).

                            Returns the resultant source object, so that you can, for  instance,  add  it  to  a
                            catalog (catalog.add(source)) or get its YAML representation (.yaml()).

                     persist(ttl=None, **kwargs)
                            Save data from this source to local persistent storage

                            Parameters

                                   ttl: numeric, optional
                                          Time  to  live  in  seconds.  If provided, the original source will be
                                          accessed and a new persisted version written transparently  when  more
                                          than  ttl  seconds  have  passed  since  the old persisted version was
                                          written.

                                   kargs: passed to the _persist method on the base container.

                     read() Load entire dataset into a container and return it

                     read_partition(i)
                            Return a part of the data corresponding to i-th partition.

                            By default, assumes i should be an integer between zero  and  npartitions;  override
                            for more complex indexing schemes.

                     to_dask()
                            Return a dask container for this data source

       class intake.source.textfiles.TextFilesSource(*args, **kwargs)
              Read textfiles as sequence of lines

              Prototype of sources reading sequential data.

              Takes  a  set  of  files, and returns an iterator over the text in each of them.  The files can be
              local or remote. Extra parameters for encoding, etc., go into storage_options.

              __init__(urlpath, text_mode=True, text_encoding='utf8', compression=None, decoder=None, read=True,
              metadata=None, storage_options=None)

                     Parameters

                            urlpath
                                   str or list(str) Target files. Can be a  glob-path  (with  "*")  and  include
                                   protocol specified (e.g., "s3://"). Can also be a list of absolute paths.

                            text_mode
                                   bool Whether to open the file in text mode, recoding binary characters on the
                                   fly

                            text_encoding
                                   str If text_mode is True, apply this encoding. UTF* is by far the most common

                            compression
                                   str  or  None If given, decompress the file with the given codec on load. Can
                                   be something like "gzip", "bz2", or  to  try  to  guess  from  the  filename,
                                   'infer'

                            decoder
                                   function,  str or None Use this to decode the contents of files. If None, you
                                   will get a list of lines of text/bytes. If a function, it must operate on  an
                                   open file-like object or a bytes/str instance, and return a list

                            read   bool  If decoder is not None, this flag controls whether bytes/str get passed
                                   to the function indicated (True) or the open file-like object (False)

                            storage_options: dict
                                   Options to pass to the file reader backend, including text-specific  encoding
                                   arguments,  and  parameters  specific  to  the  remote file-system driver, if
                                   using.

              discover()
                     Open resource and populate the source attributes.

              export(path, **kwargs)
                     Save this data for sharing with other people

                     Creates a copy of the data in a format appropriate  for  its  container,  in  the  location
                     specified (which can be remote, e.g., s3).

                     Returns  the  resultant  source  object, so that you can, for instance, add it to a catalog
                     (catalog.add(source)) or get its YAML representation (.yaml()).

              persist(ttl=None, **kwargs)
                     Save data from this source to local persistent storage

                     Parameters

                            ttl: numeric, optional
                                   Time to live in seconds. If provided, the original source  will  be  accessed
                                   and  a new persisted version written transparently when more than ttl seconds
                                   have passed since the old persisted version was written.

                            kargs: passed to the _persist method on the base container.

              read() Load entire dataset into a container and return it

              read_partition(i)
                     Return a part of the data corresponding to i-th partition.

                     By default, assumes i should be an integer between zero and npartitions; override for  more
                     complex indexing schemes.

              to_dask()
                     Return a dask container for this data source

       class intake.source.jsonfiles.JSONFileSource(*args, **kwargs)
              Read JSON files as a single dictionary or list

              The files can be local or remote. Extra parameters for encoding, etc., go into storage_options.

              __init__(urlpath: str, text_mode: bool = True, text_encoding: str = 'utf8', compression:
              Optional[str] = None, read: bool = True, metadata: Optional[dict] = None, storage_options:
              Optional[dict] = None)

                     Parameters

                            urlpath
                                   str Target file. Can include protocol specified (e.g., "s3://").

                            text_mode
                                   bool Whether to open the file in text mode, recoding binary characters on the
                                   fly

                            text_encoding
                                   str If text_mode is True, apply this encoding. UTF* is by far the most common

                            compression
                                   str  or  None If given, decompress the file with the given codec on load. Can
                                   be something like "zip", "gzip", "bz2", or to try to guess from the filename,
                                   'infer'

                            storage_options: dict
                                   Options to pass to the file reader backend, including text-specific  encoding
                                   arguments,  and  parameters  specific  to  the  remote file-system driver, if
                                   using.

              discover()
                     Open resource and populate the source attributes.

              export(path, **kwargs)
                     Save this data for sharing with other people

                     Creates a copy of the data in a format appropriate  for  its  container,  in  the  location
                     specified (which can be remote, e.g., s3).

                     Returns  the  resultant  source  object, so that you can, for instance, add it to a catalog
                     (catalog.add(source)) or get its YAML representation (.yaml()).

              persist(ttl=None, **kwargs)
                     Save data from this source to local persistent storage

                     Parameters

                            ttl: numeric, optional
                                   Time to live in seconds. If provided, the original source  will  be  accessed
                                   and  a new persisted version written transparently when more than ttl seconds
                                   have passed since the old persisted version was written.

                            kargs: passed to the _persist method on the base container.

              read() Load entire dataset into a container and return it

       class intake.source.jsonfiles.JSONLinesFileSource(*args, **kwargs)
              Read a JSONL (https://jsonlines.org/) file and return a list of objects,  each  being  valid  json
              object (e.g. a dictionary or list)

              __init__(urlpath: str, text_mode: bool = True, text_encoding: str = 'utf8', compression:
              Optional[str] = None, read: bool = True, metadata: Optional[dict] = None, storage_options:
              Optional[dict] = None)

                     Parameters

                            urlpath
                                   str Target file. Can include protocol specified (e.g., "s3://").

                            text_mode
                                   bool Whether to open the file in text mode, recoding binary characters on the
                                   fly

                            text_encoding
                                   str If text_mode is True, apply this encoding. UTF* is by far the most common

                            compression
                                   str  or  None If given, decompress the file with the given codec on load. Can
                                   be something like "zip", "gzip", "bz2", or to try to guess from the filename,
                                   'infer'.

                            storage_options: dict
                                   Options to pass to the file reader backend, including text-specific  encoding
                                   arguments,  and  parameters  specific  to  the  remote file-system driver, if
                                   using.

              discover()
                     Open resource and populate the source attributes.

              export(path, **kwargs)
                     Save this data for sharing with other people

                     Creates a copy of the data in a format appropriate  for  its  container,  in  the  location
                     specified (which can be remote, e.g., s3).

                     Returns  the  resultant  source  object, so that you can, for instance, add it to a catalog
                     (catalog.add(source)) or get its YAML representation (.yaml()).

              head(nrows: int = 100)
                     return the first nrows lines from the file

              persist(ttl=None, **kwargs)
                     Save data from this source to local persistent storage

                     Parameters

                            ttl: numeric, optional
                                   Time to live in seconds. If provided, the original source  will  be  accessed
                                   and  a new persisted version written transparently when more than ttl seconds
                                   have passed since the old persisted version was written.

                            kargs: passed to the _persist method on the base container.

              read() Load entire dataset into a container and return it

       class intake.source.npy.NPySource(*args, **kwargs)
              Read numpy binary files into an array

              Prototype source showing example of working with arrays

              Each file becomes one or more partitions, but partitioning within a file is only along the largest
              dimension, to ensure contiguous data.

              __init__(path, dtype=None, shape=None, chunks=None, storage_options=None, metadata=None)
                     The parameters dtype and shape will be determined from the first file, if not given.

                     Parameters

                            path: str of list of str
                                   Location of data file(s), possibly including glob and protocol information

                            dtype: str dtype spec
                                   In known, the dtype (e.g., "int64" or "f4").

                            shape: tuple of int
                                   If known, the length of each axis

                            chunks: int
                                   Size of chunks within a file along biggest dimension - need not be  an  exact
                                   factor of the length of that dimension

                            storage_options: dict
                                   Passed to file-system backend.

              discover()
                     Open resource and populate the source attributes.

              export(path, **kwargs)
                     Save this data for sharing with other people

                     Creates  a  copy  of  the  data  in a format appropriate for its container, in the location
                     specified (which can be remote, e.g., s3).

                     Returns the resultant source object, so that you can, for instance, add  it  to  a  catalog
                     (catalog.add(source)) or get its YAML representation (.yaml()).

              persist(ttl=None, **kwargs)
                     Save data from this source to local persistent storage

                     Parameters

                            ttl: numeric, optional
                                   Time  to  live  in seconds. If provided, the original source will be accessed
                                   and a new persisted version written transparently when more than ttl  seconds
                                   have passed since the old persisted version was written.

                            kargs: passed to the _persist method on the base container.

              read() Load entire dataset into a container and return it

              read_partition(i)
                     Return a part of the data corresponding to i-th partition.

                     By  default, assumes i should be an integer between zero and npartitions; override for more
                     complex indexing schemes.

              to_dask()
                     Return a dask container for this data source

       class intake.catalog.local.YAMLFileCatalog(*args, **kwargs)
              Catalog as described by a single YAML file

              __init__(path, autoreload=True, **kwargs)

                     Parameters

                            path: str
                                   Location of the file to parse (can be remote)

                            reload bool Whether to watch the source file for changes; make False if you want  an
                                   editable Catalog

              export(path, **kwargs)
                     Save this data for sharing with other people

                     Creates  a  copy  of  the  data  in a format appropriate for its container, in the location
                     specified (which can be remote, e.g., s3).

                     Returns the resultant source object, so that you can, for instance, add  it  to  a  catalog
                     (catalog.add(source)) or get its YAML representation (.yaml()).

              persist(ttl=None, **kwargs)
                     Save data from this source to local persistent storage

                     Parameters

                            ttl: numeric, optional
                                   Time  to  live  in seconds. If provided, the original source will be accessed
                                   and a new persisted version written transparently when more than ttl  seconds
                                   have passed since the old persisted version was written.

                            kargs: passed to the _persist method on the base container.

              reload()
                     Reload catalog if sufficient time has passed

              walk(sofar=None, prefix=None, depth=2)
                     Get all entries in this catalog and sub-catalogs

                     Parameters

                            sofar: dict or None
                                   Within recursion, use this dict for output

                            prefix: list of str or None
                                   Names of levels already visited

                            depth: int
                                   Number  of  levels to descend; needed to truncate circular references and for
                                   cleaner output

                     Returns

                            Dict where the keys are the entry names in dotted syntax, and the

                            values are entry instances.

       class intake.catalog.local.YAMLFilesCatalog(*args, **kwargs)
              Catalog as described by a multiple YAML files

              __init__(path, flatten=True, **kwargs)

                     Parameters

                            path: str
                                   Location of the files to parse (can be remote), including possible  glob  (*)
                                   character(s). Can also be list of paths, without glob characters.

                            flatten: bool (True)
                                   Whether  to  list  all  entries in the cats at the top level (True) or create
                                   sub-cats from each file (False).

              export(path, **kwargs)
                     Save this data for sharing with other people

                     Creates a copy of the data in a format appropriate  for  its  container,  in  the  location
                     specified (which can be remote, e.g., s3).

                     Returns  the  resultant  source  object, so that you can, for instance, add it to a catalog
                     (catalog.add(source)) or get its YAML representation (.yaml()).

              persist(ttl=None, **kwargs)
                     Save data from this source to local persistent storage

                     Parameters

                            ttl: numeric, optional
                                   Time to live in seconds. If provided, the original source  will  be  accessed
                                   and  a new persisted version written transparently when more than ttl seconds
                                   have passed since the old persisted version was written.

                            kargs: passed to the _persist method on the base container.

              reload()
                     Reload catalog if sufficient time has passed

              walk(sofar=None, prefix=None, depth=2)
                     Get all entries in this catalog and sub-catalogs

                     Parameters

                            sofar: dict or None
                                   Within recursion, use this dict for output

                            prefix: list of str or None
                                   Names of levels already visited

                            depth: int
                                   Number of levels to descend; needed to truncate circular references  and  for
                                   cleaner output

                     Returns

                            Dict where the keys are the entry names in dotted syntax, and the

                            values are entry instances.

       class intake.catalog.zarr.ZarrGroupCatalog(*args, **kwargs)
              A catalog of the members of a Zarr group.

              __init__(urlpath, storage_options=None, component=None, metadata=None, consolidated=False,
              name=None)

                     Parameters

                            urlpath
                                   str Location of data file(s), possibly including protocol information

                            storage_options
                                   dict, optional Passed on to storage backend for remote files

                            component
                                   str,  optional  If None, build a catalog from the root group. If given, build
                                   the catalog from the group at this location in the hierarchy.

                            metadata
                                   dict, optional Catalog metadata. If not provided, will be populated from Zarr
                                   group attributes.

                            consolidated
                                   bool, optional If True, assume Zarr metadata has been consolidated.

              export(path, **kwargs)
                     Save this data for sharing with other people

                     Creates a copy of the data in a format appropriate  for  its  container,  in  the  location
                     specified (which can be remote, e.g., s3).

                     Returns  the  resultant  source  object, so that you can, for instance, add it to a catalog
                     (catalog.add(source)) or get its YAML representation (.yaml()).

              persist(ttl=None, **kwargs)
                     Save data from this source to local persistent storage

                     Parameters

                            ttl: numeric, optional
                                   Time to live in seconds. If provided, the original source  will  be  accessed
                                   and  a new persisted version written transparently when more than ttl seconds
                                   have passed since the old persisted version was written.

                            kargs: passed to the _persist method on the base container.

              reload()
                     Reload catalog if sufficient time has passed

              walk(sofar=None, prefix=None, depth=2)
                     Get all entries in this catalog and sub-catalogs

                     Parameters

                            sofar: dict or None
                                   Within recursion, use this dict for output

                            prefix: list of str or None
                                   Names of levels already visited

                            depth: int
                                   Number of levels to descend; needed to truncate circular references  and  for
                                   cleaner output

                     Returns

                            Dict where the keys are the entry names in dotted syntax, and the

                            values are entry instances.

   Base Classes
       This is a reference API class listing, useful mainly for developers.
               ┌──────────────────────────────────────────────┬───────────────────────────────────────┐
               │ intake.source.base.DataSourceBase(*args,     │ An object which can produce data      │
               │ ...)                                         │                                       │
               ├──────────────────────────────────────────────┼───────────────────────────────────────┤
               │ intake.source.base.DataSource(*args,         │ A   Data  Source  will  all  optional │
               │ **kwargs)                                    │ functionality                         │
               ├──────────────────────────────────────────────┼───────────────────────────────────────┤
               │ intake.source.base.PatternMixin()            │ Helper  class  to  provide  file-name │
               │                                              │ parsing abilities to a driver class   │
               ├──────────────────────────────────────────────┼───────────────────────────────────────┤
               │ intake.container.base.RemoteSource(*args,    │ Base class for all DataSources living │
               │ ...)                                         │ on an Intake server                   │
               ├──────────────────────────────────────────────┼───────────────────────────────────────┤
               │ intake.catalog.Catalog(*args, **kwargs)      │ Manages  a  hierarchy of data sources │
               │                                              │ as a collective unit.                 │
               ├──────────────────────────────────────────────┼───────────────────────────────────────┤
               │ intake.catalog.entry.CatalogEntry(*args,     │ A single item appearing in a catalog  │
               │ ...)                                         │                                       │
               ├──────────────────────────────────────────────┼───────────────────────────────────────┤
               │ intake.catalog.local.UserParameter(*args,    │ A user-settable item that  is  passed │
               │ ...)                                         │ to a DataSource upon instantiation.   │
               ├──────────────────────────────────────────────┼───────────────────────────────────────┤
               │ intake.auth.base.BaseAuth(*args,             │ Base class for authorization          │
               │ **kwargs)                                    │                                       │
               ├──────────────────────────────────────────────┼───────────────────────────────────────┤
               │ intake.source.cache.BaseCache(driver,        │ Provides   utilities   for   managing │
               │ spec)                                        │ cached data files.                    │
               ├──────────────────────────────────────────────┼───────────────────────────────────────┤
               │ intake.source.base.Schema(**kwargs)          │ Holds details of data description for │
               │                                              │ any type of data-source               │
               ├──────────────────────────────────────────────┼───────────────────────────────────────┤
               │ intake.container.persist.PersistStore(*args, │ Specialised  catalog  for   persisted │
               │ ...)                                         │ data-sources                          │
               └──────────────────────────────────────────────┴───────────────────────────────────────┘

       class intake.source.base.DataSource(*args, **kwargs)
              A Data Source will all optional functionality

              When  subclassed,  child  classes  will  have  the  base  data source functionality, plus caching,
              plotting and persistence abilities.

              plot   Accessor for HVPlot methods.  See plotting for more details.

       class intake.catalog.Catalog(*args, **kwargs)
              Manages a hierarchy of data sources as a collective unit.

              A catalog is a set of available data sources for an individual entity (remote server, local  file,
              or a local directory of files). This can be expanded to include a collection of subcatalogs, which
              are then managed as a single unit.

              A catalog is created with a single URI or a collection of URIs. A URI can either be  a  URL  or  a
              file path.

              Each  catalog  in the hierarchy is responsible for caching the most recent refresh time to prevent
              overeager queries.

              Attributes

                     metadata
                            dict Arbitrary information to carry along with the data source specs.

              configure_new(**kwargs)
                     Create a new instance of this source with altered arguments

                     Enables the picking  of  options  and  re-evaluating  templates  from  any  user-parameters
                     associated with this source, or overriding any of the init arguments.

                     Returns  a new data source instance. The instance will be recreated from the original entry
                     definition in a catalog if this source was originally created from a catalog.

              discover()
                     Open resource and populate the source attributes.

              filter(func)
                     Create a Catalog of a subset of entries based on a condition

                     WARNING:
                        This function operates on CatalogEntry objects not DataSource objects.

                     NOTE:
                        Note that, whatever specific class this is  performed  on,  the  return  instance  is  a
                        Catalog.  The  entries  are passed unmodified, so they will still reference the original
                        catalog instance and include its details such as directory,.

                     Parameters

                            func   function This should take a CatalogEntry and  return  True  or  False.  Those
                                   items returning True will be included in the new Catalog, with the same entry
                                   names

                     Returns

                            Catalog
                                   New catalog with Entries that still refer to their parents

              force_reload()
                     Imperative reload data now

              classmethod from_dict(entries, **kwargs)
                     Create Catalog from the given set of entries

                     Parameters

                            entries
                                   dict-like  A  mapping  of  name:entry which supports dict-like functionality,
                                   e.g., is derived from collections.abc.Mapping.

                            kwargs passed on the constructor Things like metadata, name; see __init__.

                     Returns

                            Catalog instance

              get(**kwargs)
                     Create a new instance of this source with altered arguments

                     Enables the picking  of  options  and  re-evaluating  templates  from  any  user-parameters
                     associated with this source, or overriding any of the init arguments.

                     Returns  a new data source instance. The instance will be recreated from the original entry
                     definition in a catalog if this source was originally created from a catalog.

              property gui
                     Source GUI, with parameter selection and plotting

              items()
                     Get an iterator over (key, source) tuples for the catalog entries.

              keys() Entry names in this catalog as an iterator (alias for __iter__)

              pop(key)
                     Remove entry from catalog and return it

                     This relies on the _entries attribute being mutable, which it normally is. Note that  if  a
                     catalog automatically reloads, any entry removed here may soon reappear

                     Parameters

                            key    str Key to give the entry in the cat

              reload()
                     Reload catalog if sufficient time has passed

              save(url, storage_options=None)
                     Output this catalog to a file as YAML

                     Parameters

                            url    str Location to save to, perhaps remote

                            storage_options
                                   dict Extra arguments for the file-system

              serialize()
                     Produce YAML version of this catalog.

                     Note  that  this  is not the same as .yaml(), which produces a YAML block referring to this
                     catalog.

              values()
                     Get an iterator over the sources for catalog entries.

              walk(sofar=None, prefix=None, depth=2)
                     Get all entries in this catalog and sub-catalogs

                     Parameters

                            sofar: dict or None
                                   Within recursion, use this dict for output

                            prefix: list of str or None
                                   Names of levels already visited

                            depth: int
                                   Number of levels to descend; needed to truncate circular references  and  for
                                   cleaner output

                     Returns

                            Dict where the keys are the entry names in dotted syntax, and the

                            values are entry instances.

       class intake.catalog.entry.CatalogEntry(*args, **kwargs)
              A single item appearing in a catalog

              This  is the base class, used by local entries (i.e., read from a YAML file) and by remote entries
              (read from a server).

              describe()
                     Get a dictionary of attributes of this entry.

                     Returns: dict with keys

                            name: str
                                   The name of the catalog entry.

                            container
                                   str kind of container used by this data source

                            description
                                   str Markdown-friendly description of data source

                            direct_access
                                   str Mode of remote access: forbid, allow, force

                            user_parameters
                                   list[dict] List of user parameters defined by this entry

              get(**user_parameters)
                     Open the data source.

                     Equivalent to calling the catalog entry like a function.

                     Note: entry(), entry.attr, entry[item] check for persisted sources,  but  directly  calling
                     .get() will always ignore the persisted store (equivalent to self._pmode=='never').

                     Parameters

                            user_parameters
                                   dict Values for user-configurable parameters for this data source

                     Returns

                            DataSource

              property has_been_persisted
                     For the source created with the given args, has it been persisted?

              property plots
                     List custom associated quick-plots

       class intake.container.base.RemoteSource(*args, **kwargs)
              Base class for all DataSources living on an Intake server

              to_dask()
                     Return a dask container for this data source

       class intake.catalog.local.UserParameter(*args, **kwargs)
              A user-settable item that is passed to a DataSource upon instantiation.

              For  string  parameters,  default  may include special functions func(args), which may be expanded
              from environment variables or by executing a shell command.

              Parameters

                     name: str
                            the key that appears in the DataSource argument strings

                     description: str
                            narrative text

                     type: str
                            one of list (COERSION_RULES)

                     default: type value
                            same type as type. It a str, may include special functions env,  shell,  client_env,
                            client_shell.

                     min, max: type value
                            for validation of user input

                     allowed: list of type
                            for validation of user input

              describe()
                     Information about this parameter

              expand_defaults(client=False, getenv=True, getshell=True)
                     Compile env, client_env, shell and client_shell commands

              validate(value)
                     Does value meet parameter requirements?

       class intake.auth.base.BaseAuth(*args, **kwargs)
              Base class for authorization

              Subclass this and override the methods to implement a new type of auth.

              This basic class allows all access.

              allow_access(header, source, catalog)
                     Is the given HTTP header allowed to access given data source

                     Parameters

                            header: dict
                                   The HTTP header from the incoming request

                            source: CatalogEntry
                                   The data source the user wants to access.

                            catalog: Catalog
                                   The catalog object containing this data source.

              allow_connect(header)
                     Is the requests header given allowed to talk to the server

                     Parameters

                            header: dict
                                   The HTTP header from the incoming request

              get_case_insensitive(dictionary, key, default=None)
                     Case-insensitive search of a dictionary for key.

                     Returns the value if key match is found, otherwise default.

       class intake.source.cache.BaseCache(driver, spec, catdir=None, cache_dir=None, storage_options={})
              Provides utilities for managing cached data files.

              Providers of caching functionality should derive from this, and appear as entries in registry. The
              principle methods to override are _make_files() and _load() and _from_metadata().

              clear_all()
                     Clears all cache and metadata.

              clear_cache(urlpath)
                     Clears cache and metadata for a given urlpath.

                     Parameters

                            urlpath: str, location of data
                                   May be a local path, or remote path if including a protocol specifier such as
                                   's3://'. May include glob wildcards.

              get_metadata(urlpath)

                     Parameters

                            urlpath: str, location of data
                                   May be a local path, or remote path if including a protocol specifier such as
                                   's3://'. May include glob wildcards.

                     Returns

                            Metadata (dict) about a given urlpath.

              load(urlpath, output=None, **kwargs)
                     Downloads  data from a given url, generates a hashed filename, logs metadata, and caches it
                     locally.

                     Parameters

                            urlpath: str, location of data
                                   May be a local path, or remote path if including a protocol specifier such as
                                   's3://'. May include glob wildcards.

                            output: bool
                                   Whether to show progress bars; turn off for testing

                     Returns

                            List of local cache_paths to be opened instead of the remote file(s). If

                            caching is disable, the urlpath is returned.

       class intake.source.base.PatternMixin
              Helper class to provide file-name parsing abilities to a driver class

       class intake.source.base.Schema(**kwargs)
              Holds details of data description for any type of data-source

              This should always be pickleable, so that it can be sent from a server to a  client,  and  contain
              all information needed to recreate a RemoteSource on the client.

       class intake.container.persist.PersistStore(*args, **kwargs)
              Specialised catalog for persisted data-sources

              add(key, source)
                     Add the persisted source to the store under the given key

                     key    str The unique token of the un-persisted, original source

                     source DataSource  instance  The  thing  to  add  to  the persisted catalogue, referring to
                            persisted data

              backtrack(source)
                     Given a unique key in the store, recreate original source

              get_tok(source)
                     Get string token from object

                     Strings are assumed to already be a token; if source or entry, see if  it  is  a  persisted
                     thing ("original_tok" is in its metadata), else generate its own token.

              needs_refresh(source)
                     Has the (persisted) source expired in the store

                     Will  return  True if the source is not in the store at all, if it's TTL is set to None, or
                     if more seconds have passed than the TTL.

              refresh(key)
                     Recreate and re-persist the source for the given unique ID

              remove(source, delfiles=True)
                     Remove a dataset from the persist store

                     source str or DataSource or Lo If a str, this is the unique  ID  of  the  original  source,
                            which  is  the  key  of  the persisted dataset within the store. If a source, can be
                            either the original or the persisted source.

                     delfiles
                            bool Whether to remove the on-disc artifact

   Other Classes
   Cache Types
               ───────────────────────────────────────────────────────────────────────────────────────
                 intake.source.cache.FileCache(driver,         Cache specific set of files
                 spec)
               ───────────────────────────────────────────────────────────────────────────────────────
                 intake.source.cache.DirCache(driver,          Cache a complete directory tree
                 spec[, ...])
               ───────────────────────────────────────────────────────────────────────────────────────
                 intake.source.cache.CompressedCache(driver,   Cache files extracted from downloaded
                 spec)                                         compressed source
               ───────────────────────────────────────────────────────────────────────────────────────
                 intake.source.cache.DATCache(driver, spec[,   Use the  DAT  protocol  to  replicate
                 ...])                                         data
               ───────────────────────────────────────────────────────────────────────────────────────
                 intake.source.cache.CacheMetadata(*args,      Utility class for managing persistent
                 ...)                                          metadata  stored in the Intake config
                                                               directory.
               ┌─────────────────────────────────────────────┬───────────────────────────────────────┐
               │                                             │                                       │
--

ROADMAP

       Some  high-level  work  that  we  expect  to  be  achieved  on the time-scale of months. This list is not
       exhaustive, but rather aims to whet the appetite for what Intake can be in the future.

       Since Intake aims to be a community of data-oriented pythoneers, nothing written here is laid  in  stone,
       and users and devs are encouraged to make their opinions known!

   Broaden the coverage of formats
       Data-type  drivers  are easy to write, but still require some effort, and therefore reasonable impetus to
       get the work done. Conversations over the coming months can help determine the  drivers  that  should  be
       created by the Intake team, and those that might be contributed by the community.

       The  next type that we would specifically like to consider is machine learning model artifacts.  EDIT see
       https://github.com/AlbertDeFusco/intake-sklearn , and hopefully more to come.

   Streaming Source
       Many data sources are inherently time-sensitive and event-wise. These are not covered  well  by  existing
       Python  tools,  but  the  streamz  library may present a nice way to model them. From the Intake point of
       view, the task would be to develop a streaming type, and at least one data driver that uses it.

       The most obvious place to start would be read a file: every time a new line appears in the file, an event
       is emitted. This is appropriate, for instance, for watching the log files of  a  web-server,  and  indeed
       could be extended to read from an arbitrary socket.

       EDIT see: https://github.com/intake/intake-streamz

   Server publish hooks
       To  add  API  endpoints  to  the  server,  so  that  a  user  (with  sufficient  privilege) can post data
       specifications to a running server, optionally saving the specs to a catalog server-side. Furthermore, we
       will consider the possibility of being able to upload and/or transform data (rather than refer to it in a
       third-party location), so that you would have a one-line "publish" ability from the client.

       The  server,  in  general,  could  do  with  a  lot  of  work   to   become   more   than   the   current
       demonstration/prototype. In particular, it should be able to be performant and scalable, meaning that the
       server implementation ought to keep as little local state as possible.

   Simplify dependencies and class hierarchy
       We  would  like  the  make  it  easier  to  write  Intake  drivers  which  don't  need any persist or GUI
       functionality, and to be able to install Intake core functionality (driver  registry,  data  loading  and
       catalog traversal) without needing many other packages at all.

       EDIT  this  has  been partly done, you can derive from DataSourceBase and not have to use the full set of
       Intake's features for simplicity. We have also gone some distance to separate out dependencies for  parts
       of  the  package,  so  that you can install Intake and only use some of the subpackages/modules - imports
       don't happen until those parts of the code are used. We have not yet split the intake conda package into,
       for example, intake-base, intake-server, intake-gui...

   Reader API
       For those that wish to provide Intake's data source API,  and  make  data  sources  available  to  Intake
       cataloguing,  but  don't  wish  to  take Intake as a direct dependency.  The actual API of DataSources is
       rather simple:

       • __init__: collect arguments, minimal IO at this point

       • discover(): get metadata from the source, by querying the files/service itself

       • read(): return in-memory version of the data

       • to_*: return reference objects for the given compute engine, typically Dask

       • read_partition(...): read part of the data into memory, where the argument makes sense  for  the  given
         type of data

       • configure_new(): create new instance with different arguments

       • yaml(): representation appropriate for inclusion in a YAML catalogue

       • close(): release any resources

       Of  these,  only  the first three are really necessary for a iminal interface, so Intake might do well to
       publish this protocol specification, so that new drivers can be written that can be used by Intake but do
       not need Intake, and so help adoption.

GLOSSARY

       Argument
              One of a set of values passed to a function or class. In the Intake sense, this usually is the set
              of key-value pairs defined in  the  "args"  section  of  a  source  definition;  unless  the  user
              overrides, these will be used for instantiating the source.

       Cache  Local  copies  of  remote files. Intake allows for download-on-first-use for data-sources, so that
              subsequent access is much faster, see caching. The format of the files is unchanged in this  case,
              but may be decompressed.

       Catalog
              A  collection  of  entries, each of which corresponds to a specific Data-set. Within these docs, a
              catalog  is  most  commonly  defined  in  a  YAML  file,  for  simplicity,  but  there  are  other
              possibilities,  such as connecting to an Intake server or another third-party data service, like a
              SQL database. Thus, catalogs form a hierarchy: any catalog can contain other, nested catalogs.

       Catalog file
              A YAML specification file which contains a list of named  entries  describing  how  to  load  data
              sources. catalog.

       Conda  A  package  and  environment  management  package for the python ecosystem, see the conda website.
              Conda ensures dependencies and correct versions  are  installed  for  you,  provides  precompiled,
              binary-compatible software, and extends to many languages beyond python, such as R, javascript and
              C.

       Conda package
              A  single  installable  item  which  the  Conda  application  can install. A package may include a
              Catalog, data-files and maybe some additional code. It will also include a  specification  of  the
              dependencies  that it requires (e.g., Intake and any additional Driver), so that Conda can install
              those automatically. Packages can be created locally, or can be found  on  anaconda.org  or  other
              package repositories.

       Container
              One  of  the  supported data formats. Each Driver outputs its data in one of these. The containers
              correspond to familiar data structures for end-analysis, such as list-of-dicts, Numpy nd-array  or
              Pandas data-frame.

       Data-set
              A specific collection of data. The type of data (tabular, multi-dimensional or something else) and
              the  format (file type, data service type) are all attributes of the data-set. In addition, in the
              context of Intake, data-sets are usually entries within a Catalog with additional descriptive text
              and metadata and a specification of how to load the data.

       Data Source
              An Intake specification for a specific Data-set. In most cases, the two terms are synonymous.

       Data User
              A person who uses data to produce models and other inferences/conclusions. This  person  generally
              uses  standard  python  analysis  packages  like  Numpy, Pandas, SKLearn and may produce graphical
              output. They will want to be able to find the right data for a given job, and for the data  to  be
              available  in  a  standard  format  as  quickly and easily as possible. In many organisations, the
              appropriate job title may be Data Scientist, but research scientists and BI/analysts also fit this
              description.

       Data packages
              Data packages are standard conda packages that install an Intake  catalog  file  into  the  user’s
              conda  environment  ($CONDA_PREFIX/share/intake).  A data package does not necessarily imply there
              are data files inside the package. A data package could describe  remote  data  sources  (such  as
              files in S3) and take up very little space on disk.

       Data Provider
              A  person  whose  main  objective  is  to  curate data sources, get them into appropriate formats,
              describe the contents, and disseminate the data to those that need to use them. Such a person  may
              care  about  the  specifics of the storage format and backing store, the right number of fields to
              keep and removing bad data. They may have a good idea of  the  best  way  to  visualise  any  give
              data-set.  In  an  organisation, this job may be known as Data Engineer, but it could as easily be
              done by a member of the IT team. These people are the most likely to author Catalogs.

       Developer
              A person who writes or fixes code. In the context of Intake,  a  developer  may  make  new  format
              Drivers,  create  authentication  systems  or  add  functionality  to Intake itself. They can take
              existing code for loading data in other projects, and use Intake to add extra functionality to it,
              for instance, remote data access, parallel processing, or file-name parsing.

       Driver The thing that does the work of reading the data for a catalog entry is known as a  driver,  often
              referred  to  using a simple name such as "csv". Intake has a plugin architecture, and new drivers
              can be created or installed, and specific catalogs/data-sets may require  particular  drivers  for
              their  contained  data-sets.  If  installed  as  Conda  packages,  then these requirements will be
              automatically installed for you. The driver's output will be a Container, and often the code is  a
              simpler layer over existing functionality in a third-party package.

       GUI    A Graphical User Interface. Intake comes with a GUI for finding and selecting data-sets, see gui.

       IT     The Information Technology team for an organisation. Such a team may have control of the computing
              infrastructure and security (sys-ops), and may well act as gate-keepers when exposing data for use
              by  other colleagues. Commonly, IT has stronger policy enforcement requirements that other groups,
              for instance requiring all data-set copy actions to be logged centrally.

       Persist
              A process of making a local version of a data-source. One canonical format is used for each of the
              container types, optimised for quick and parallel access. This is particularly useful if the  data
              takes  a  long  time  to  acquire, perhaps because it is the result of a complex query on a remote
              service. The resultant output can be set to expire and be automatically refreshed, see persisting.
              Not to be confused with the cache.

       Plugin Modular extra functionality for Intake, provided by a package that is  installed  separately.  The
              most  common  type  of  plugin will be for a Driver to load some particular data format; but other
              parts of Intake are pluggable, such as authentication mechanisms for the server.

       Server A remote source for Intake catalogs. The server will provide data source specifications  (i.e.,  a
              remote  Catalog), and may also provide the raw data, in situations where the client is not able or
              not allowed to access it directly. As such, the server can act as a gatekeeper  of  the  data  for
              security  and monitoring purposes. The implementation of the server in Intake is accessible as the
              intake-server command, and acts as a reference: other implementations can easily  be  created  for
              specific circumstances.

       TTL    Time-to-live, how long before the given entity is considered to have expired. Usually in seconds.

       User Parameter
              A  data  source  definition can contain a "parameters" section, which can act as explicit decision
              indicators for the user, or as validation and type coersion for the definition's Argument  s.  See
              paramdefs.

       YAML   A  text-based  format for expressing data with a dictionary (key-value) and list structure, with a
              limited number of data-types, such as strings and numbers. YAML uses indentations to nest objects,
              making it easy to read and write for humans, compared to JSON. Intake's catalogs  and  config  are
              usually expressed in YAML files.

COMMUNITY

Intake is used and developed by individuals at a variety of institutions. It is open source (license)
and sits within the broader Python numeric ecosystem commonly referred to as PyData or SciPy.

Discussion
Conversation happens in the following places:

1. Usage questions are directed to Stack Overflow with the #intake tag. Intake developers monitor this
tag.

2. Bug reports and feature requests are managed on the GitHub issue tracker. Individual intake plugins
are managed in separate repositories each with its own issue tracker. Please consult the
plugin-directory for a list of available plugins.

3. Chat occurs on at gitter.im/ContinuumIO/intake. Note that because gitter chat is not searchable by
future users we discourage usage questions and bug reports on gitter and instead ask people to use
Stack Overflow or GitHub.

4. Monthly community meeting happens the first Thursday of the month at 9:00 US Central Time. See
https://github.com/intake/intake/issues/596, with a reminder sent out on the gitter channel. Strictly
informal chatter.

Asking for help
We welcome usage questions and bug reports from all users, even those who are new to using the project.
There are a few things you can do to improve the likelihood of quickly getting a good answer.

1. Ask questions in the right place: We strongly prefer the use of Stack Overflow or GitHub issues over
Gitter chat. GitHub and Stack Overflow are more easily searchable by future users, and therefore is
more efficient for everyone's time. Gitter chat is strictly reserved for developer and community
discussion.

If you have a general question about how something should work or want best practices then use Stack
Overflow. If you think you have found a bug then use GitHub

2. Ask only in one place: Please restrict yourself to posting your question in only one place (likely
Stack Overflow or GitHub) and don't post in both

3. Create a minimal example: It is ideal to create minimal, complete, verifiable examples. This
significantly reduces the time that answerers spend understanding your situation, resulting in higher
quality answers more quickly.

• genindex

• modindex

• search

AUTHOR

       Anaconda

COPYRIGHT

       2022, Anaconda

0.6.5                                             Jan 17, 2022                                         INTAKE(1)