Ubuntu Manpage: bp_genbank_ref_extractor - Retrieves all related sequences for a list of searches on Entrez gene

Provided by: libbio-eutilities-perl_1.77-3_all

NAME

       bp_genbank_ref_extractor - Retrieves all related sequences for a list of searches on Entrez gene

VERSION

       version 1.77

SYNOPSIS

       bp_genbank_ref_extractor [options] [Entrez Gene Queries]

DESCRIPTION

       This script searches on Entrez Gene database and retrieves not only the gene sequence but also the
       related transcript and protein sequences.

       The gene UIDs of multiple searches are collected before attempting to retrieve them so each gene will
       only be analyzed once even if appearing as result on more than one search.

       Note that by default no sequences are saved (see options and examples).

OPTIONS

Several options can be used to fine tune the script behaviour. It is possible to obtain extra base pairs
upstream and downstream of the gene, control the naming of files and genome assembly to use.

See the section bugs for problems when using default values of options.

--assembly
When retrieving the sequence, a specific assemly can be defined. The value expected is a regex that
will be case-insensitive. If it matches more than one assembly, it will use the first match. It
defaults to "(primary|reference) assembly".

--debug
If set, even more output will be printed that may help on debugging. Unlike the messages from
--verbose and --very-verbose, these will not appear on the log file unless this option is selected.
This option also sets --very-verbose.

--downstream, --down
Specifies the number of extra base pairs to be retrieved downstream of the gene. This extra base
pairs will only affect the gene sequence, not the transcript or proteins.

--email
A valid email used to connect to the NCBI servers. This may be used by NCBI to contact users in case
of problems and before blocking access in case of heavy usage.

B <--api-key>
NCBI requires an API key for requests over 10/sec as of December 2018. You may generate one in the
"My NCBI" area.

--format
Specifies the format that the sequences will be saved. Defaults to genbank format. Valid formats are
'genbank' or 'fasta'.

--genes
Specifies the name for gene file. By default, they are not saved. If no value is given defaults to
its UID. Possible values are 'uid', 'name', 'symbol' (the official symbol or nomenclature).

--help
Display the documentation (this text).

--limit
When making a query, limit the result to these first specific results. This is to prevent the use of
specially unspecific queries and a warning will be given if a query returns more results than the
limit. The default value is 200. Note that this limit is for each search.

--non-coding, --nonon-coding
Some protein coding genes have transcripts that are non-coding. By default, these sequences are saved
as well. --nonon-coding can be used to ignore those transcripts.

--proteins
Specifies the name for proteins file. By default, they are not saved. If no value is given defaults
to its accession. Possible values are 'accession', 'description', 'gene' (the corresponding gene ID)
and 'transcript' (the corresponding transcript accesion).

Note that if not using 'accession' is possible for files to be overwritten. It is possible for the
same gene to encode more than one protein or different proteins to have the same description.

--pseudo, --nopseudo
By default, sequences of pseudo genes will be saved. --nopseudo can be used to ignore those genes.

--save
Specifies the path for the directory where the sequence and log files will be saved. If the directory
does not exist it will be created although the path to it must exist. Files on the directory may be
rewritten if necessary. If unspecified, a directory named extracted sequences on the current
directory will be used.

--save-data
This options saves the data (gene UIDs, description, product accessions, etc) to a file. As an
optional value, the file format can be specified. Defaults to CSV.

Currently only CSV is supported.

Saving the data structure as a CSV file, requires the installation of the Text::CSV module.

--transcripts, --mrna
Specifies the name for transcripts file. By default, they are not saved. If no value is given
defaults to its accession. Possible values are 'accession', 'description', 'gene' (the corresponding
gene ID) and 'protein' (the protein the transcript encodes).

Note that if not using 'accession' is possible for files to be overwritten. It is possible for the
same gene to have more than one transcript or different transcripts to have the same description.
Also, non-coding transcripts will create problems if using 'protein'.

--upstream, --up
Specifies the number of extra base pairs to be extracted upstream of the gene. This extra base pairs
will only affect the gene sequence, not the transcript or proteins.

--verbose, --v
If set, program becomes verbose. For an extremely verbose program, use --very-verbose instead.

--very-verbose, --vv
If set, program becomes extremely verbose. Setting this option, automatically sets --verbose as well.
For help in debugging, consider using --debug

EXAMPLES

        bp_genbank_ref_extractor \
          --transcripts=accession \
          '"homo sapiens"[organism] AND H2B'

       Search Entrez Gene with the query '"homo sapiens"[organism] AND H2B' and save their transcripts sequences
       only.  Note that default value of --limit may only extract some of the hits.

        bp_genbank_ref_extractor \
          --transcripts=accession --proteins=accession \
           --format=fasta \
           '"homo sapiens"[organism] AND H2B' \
           '"homo sapiens"[organism] AND MCPH1'

       Save   both   transcript   and   protein   sequences  in  the  fasta  format,  for  two  queries,  '"homo
       sapiens"[organism] AND H2B' and '"homo sapiens"[organism] AND MCPH1'.

        bp_genbank_ref_extractor \
          --genes --down=500 --up=100 \
          '"homo sapiens"[organism] AND H2B'

       Download genomic sequences, including 500 bp downstream and 100 bp upstream of each gene.

        bp_genbank_ref_extractor \
          --genes --asembly='Alternate HuRef' \
          '"homo sapiens"[organism] AND H2B'

       Download genomic sequences from the Alternate HuRef genome assembly.

        bp_genbank_ref_extractor --save-data=CSV \
          '"homo sapiens"[organism] AND H2B'

       Do not save any sequence, only save the results in a CSV file.

        bp_genbank_ref_extractor --save='search-results' \
          --genes=name  downstream=500 --upstream=200 \
          --nopseudo --nonnon-coding --transcripts --proteins \
          --format=fasta --save-data=CSV \
          '"homo sapiens"[organism] AND H2B' \
          '"homo sapiens"[organism] AND MCPH1'

       Ignoring non-coding and pseudo genes, downloads: genomic sequences with 500 and  200  bp  downstream  and
       upstream  respectively,  using  the  gene name as filename; transcript and proteins sequences using their
       accession number as filename; everything in fasta format plus a CSV file with search results; saved in  a
       directory named search-results

NON-BUGS

       •   When  supplying options, it's possible to not supply a value and use their default. However, when the
           expected value is a string, the next argument may be confused as value for the option.  For  example,
           when using the following command:

            bp_genbank_ref_extractor --transcripts \
              'H2A AND homo sapiens'

           we  mean  to  search  for 'H2A AND homo sapiens' saving only the transcripts and using the default as
           base for the filename. However, the search terms will be interpreted as the base  for  the  filenames
           (but  since  it's  not  a valid identifier, it will return an error). To prevent this, you can either
           specify the values:

            bp_genbank_ref_extractor --transcripts='accession' \
              'H2A AND homo sapiens'

           or you can use the double hash to stop processing options. Note that this should only be  used  after
           the last option. All arguments supplied after the double dash will be interpreted as search terms

            bp_genbank_ref_extractor --transcripts \
              -- 'H2A AND homo sapiens'

NOTES ON USAGE

       •   Genes  that  are  marked as 'live' and 'protein-coding' should have at least one transcript. However,
           This is not always true due to mistakes on annotation. Such cases will throw a  warning.  When  faced
           with     this,     be     nice     and     write     to     the     entrez     RefSeq     maintainers
           <http://www.ncbi.nlm.nih.gov/RefSeq/update.cgi>.

       •   When creating the directories to save the files, if the directory already exists it will be used  and
           no error or warning will be issued unless --debug as been set. If a non-directory file already exists
           with that name bp_genbank_ref_extractor exits with an error.

       •   On  the  subject  of  verbosity,  all  messages  are saved on the log file. The options --verbose and
           --very-verbose only affect their printing to standard output. Debug messages are  different  as  they
           will only show up (and be logged) if requested with --debug.

       •   When  saving  a  file,  to  avoid  problems  with  limited filesystems such as NTFS or FAT, only some
           characters are allowed. All other characters will be replaced by an  underscore.  Allowed  characters
           are:

           a-z 0-9 - +  . , () {} []'

       •   bp_genbank_ref_extractor  tries to use the same file extensions that bioperl would expect when saving
           the file. If unable it will use the '.seq' extension.

FEEDBACK

   Mailing lists
       User feedback is an integral part of the evolution of this and other Bioperl modules. Send your  comments
       and suggestions preferably to the Bioperl mailing list.  Your participation is much appreciated.

         bioperl-l@bioperl.org               - General discussion
         https://bioperl.org/Support.html    - About the mailing lists

   Support
       Please direct usage questions or support issues to the mailing list: bioperl-l@bioperl.org rather than to
       the  module  maintainer directly. Many experienced and reponsive experts will be able look at the problem
       and quickly address it. Please include a thorough description of the problem with code and data  examples
       if at all possible.

   Reporting bugs
       Report  bugs  to  the Bioperl bug tracking system to help us keep track of the bugs and their resolution.
       Bug reports can be submitted via the web:

         https://github.com/bioperl/bio-eutilities/issues

AUTHOR

       Carnë Draug <carandraug+dev@gmail.com>

COPYRIGHT

       This software is copyright (c) 2011-2015 by Carnë Draug.

       This software is available under the GNU General Public License, Version 3, June 2007.

perl v5.40.0                                       2025-01-27                       BP_GENBANK_REF_EXTRACTOR(1p)