Skip to content

Commit 0375cd1

Browse files
committed
Update README for 0.5
1 parent 875b814 commit 0375cd1

1 file changed

Lines changed: 275 additions & 69 deletions

File tree

README.rst

Lines changed: 275 additions & 69 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,19 @@
11
pdbsearch
22
=========
33

4-
|travis| |coveralls| |pypi| |version| |commit|
4+
|ci| |version| |pypi| |license| |commit|
55

6-
.. |travis| image:: https://api.travis-ci.org/samirelanduk/pdbsearch.svg?branch=master
7-
:target: https://travis-ci.org/samirelanduk/pdbsearch/
6+
.. |ci| image:: https://github.com/samirelanduk/pdbsearch/actions/workflows/main.yml/badge.svg
7+
:target: https://github.com/samirelanduk/pdbsearch/actions/workflows/main.yml
88

9-
.. |coveralls| image:: https://coveralls.io/repos/github/samirelanduk/pdbsearch/badge.svg?branch=master
10-
:target: https://coveralls.io/github/samirelanduk/pdbsearch/
9+
.. |version| image:: https://img.shields.io/pypi/v/pdbsearch.svg
10+
:target: https://pypi.org/project/pdbsearch/
1111

1212
.. |pypi| image:: https://img.shields.io/pypi/pyversions/pdbsearch.svg
1313
:target: https://pypi.org/project/pdbsearch/
1414

15-
.. |version| image:: https://img.shields.io/pypi/v/pdbsearch.svg
16-
:target: https://pypi.org/project/pdbsearch/
15+
.. |license| image:: https://img.shields.io/pypi/l/pdbsearch.svg?color=blue
16+
:target: https://github.com/samirelanduk/pdbsearch/blob/master/LICENSE
1717

1818
.. |commit| image:: https://img.shields.io/github/last-commit/samirelanduk/pdbsearch/master.svg
1919
:target: https://github.com/samirelanduk/pdbsearch/tree/master/
@@ -25,9 +25,13 @@ Example
2525
-------
2626

2727
>>> import pdbsearch
28-
>>> codes = pdbsearch.search(limit=5, ligand_name="CU")
29-
>>> codes
30-
['3HW7', '2WKO', '2WOF', '2WOH', '2WO0']
28+
>>> results = pdbsearch.search(rows=5, chem_comp__name__contains="zinc")
29+
>>> print(results["total_count"])
30+
26
31+
>>> print(results["result_set"])
32+
[{'identifier': '1A0B', 'score': 1.0}, {'identifier': '1A1F', 'score': 1.0},
33+
{'identifier': '1A1G', 'score': 1.0}, {'identifier': '1A1H', 'score': 1.0},
34+
{'identifier': '1A1I', 'score': 1.0}]
3135

3236

3337
Installing
@@ -45,16 +49,6 @@ If you get permission errors, try using ``sudo``:
4549
``$ sudo pip install pdbsearch``
4650

4751

48-
Development
49-
~~~~~~~~~~~
50-
51-
The repository for pdbsearch, containing the most recent iteration, can be
52-
found `here <http://github.com/samirelanduk/pdbsearch/>`_. To clone the
53-
pdbsearch repository directly from there, use:
54-
55-
``$ git clone git://github.com/samirelanduk/pdbsearch.git``
56-
57-
5852
Requirements
5953
~~~~~~~~~~~~
6054

@@ -80,71 +74,283 @@ Overview
8074
pdbsearch is a Python library for searching for PDB structures using the
8175
RCSB web services.
8276

83-
Returning all PDB Codes
84-
~~~~~~~~~~~~~~~~~~~~~~~
77+
Basic Search
78+
~~~~~~~~~~~~
8579

86-
You can get all PDB codes without any particular search expression like so:
80+
The default search will return PDB entry IDs, with no filtering:
8781

8882
>>> import pdbsearch
89-
>>> codes = pdbsearch.search(limit=None)
90-
>>> len(codes)
91-
174994
92-
93-
This will take a few seconds, and requires downloading a rather large JSON
94-
object over the network. Generally it is better to paginate the results:
95-
96-
>>> first_ten_codes = pdbsearch.search(limit=10)
97-
>>> second_ten_codes = pdbsearch.search(start=10, limit=10)
98-
>>> third_ten_codes = pdbsearch.search(start=20, limit=10)
99-
100-
You can sort the results by any of the terms at
101-
`<https://search.rcsb.org/structure-search-attributes.html>`_:
102-
103-
>>> most_recent_codes = pdbsearch.search(sort="rcsb_accession_info.deposit_date")
104-
>>> earliest_codes = pdbsearch.search(sort="-rcsb_accession_info.deposit_date")
105-
106-
As these are somewhat cumbersome, some of them have a shorthand:
83+
>>> results = pdbsearch.search()
84+
>>> print(results)
85+
{'query_id': '9a0c3543-0e29-462c-8357-f286293d9896', 'result_type': 'entry',
86+
'total_count': 247417, 'result_set': [{'identifier': '100D', 'score': 1.0},
87+
{'identifier': '101D', 'score': 1.0}, {'identifier': '101M', 'score': 1.0},
88+
{'identifier': '102D', 'score': 1.0}, {'identifier': '102L', 'score': 1.0},
89+
{'identifier': '102M', 'score': 1.0}, {'identifier': '103D', 'score': 1.0},
90+
{'identifier': '103L', 'score': 1.0}, {'identifier': '103M', 'score': 1.0},
91+
{'identifier': '104D', 'score': 1.0}]}
92+
93+
The JSON returned is the direct output from the RCSB search API. By default, it
94+
returns 10 results at a time.
95+
96+
Services
97+
########
98+
99+
RCSB defines a number of services which search in different ways. For example,
100+
the full text search service will search the entire PDB database for the given
101+
term, and you can access this with the ``term`` keyword argument:
102+
103+
>>> results = pdbsearch.search(term="thymidine kinase")
104+
105+
You can search for entries with other services, using the correct keyword
106+
arguments:
107+
108+
>>> # Sequence service
109+
>>> results = pdbsearch.search(protein="MALWMRLLPLLALLALWGPDPAAA")
110+
>>> results = pdbsearch.search(dna="ATGC", identity=0.95, evalue=1e-10)
111+
>>> results = pdbsearch.search(rna="AUGC")
112+
>>>
113+
>>> # Sequence motif service
114+
>>> results = pdbsearch.search(protein="C-X-C-X(2)-[LIVMYFWC]", pattern_type="prosite")
115+
>>> results = pdbsearch.search(dna="GTXXCA", pattern_type="simple")
116+
>>> results = pdbsearch.search(rna="AUG{2}C", pattern_type="regex")
117+
>>>
118+
>>> # Structure service (requires the ID of a specific assembly to look for)
119+
>>> results = pdbsearch.search(structure="1A2B-2", operator="relaxed_shape_match")
120+
>>>
121+
>>> # Structure motif service (requires a residue pattern in some entry)
122+
>>> results = pdbsearch.search(entry="1A2B", residues=(("A", 1), ("B", 2)))
123+
>>>
124+
>>> # Chemical service
125+
>>> results = pdbsearch.search(smiles="CC(C)C", match_type="graph-relaxed-stereo")
126+
>>> results = pdbsearch.search(inchi="InChI=1S/C6H12/c1-2-4-6-5-3-1/h1-6H2")
127+
128+
The most useful service however, is the text service. The documentation for the
129+
RCSB search API lists a number of attributes that you can search on, and which
130+
can be viewed `here <https://search.rcsb.org/structure-search-attributes.html>`_.
131+
For example, ``pdbx_entity_nonpoly.name`` or
132+
``rcsb_nonpolymer_entity.pdbx_number_of_molecules``. You pass these as keyword
133+
arguments to the ``pdbsearch.search`` function, with the prefix ``__``
134+
in place of the dot.
135+
136+
>>> results = pdbsearch.search(pdbx_entity_nonpoly__name="glucose")
137+
>>> results = pdbsearch.search(rcsb_nonpolymer_entity__pdbx_number_of_molecules=0)
138+
139+
As the `RCSB documentation <https://search.rcsb.org/#search-api>`_ indicates, you can use a variety of operators to
140+
modify how you search, and these are encoded as suffixes to the keyword
141+
arguments:
142+
143+
- ``__gt``: ``greater``
144+
- ``__lt``: ``less``
145+
- ``__gte``: ``greater_or_equal``
146+
- ``__lte``: ``less_or_equal``
147+
- ``__in``: ``in``
148+
- ``__exists``: ``exists``
149+
- ``__range``: ``range`` (use ``tuple`` for exclusive range, ``list`` for inclusive range)
150+
- ``__contains``: ``contains_phrase``
151+
- ``__contains_phrase``: ``contains_phrase``
152+
- ``__contains_words``: ``contains_words``
153+
- ``__not``: ``not``
154+
155+
You can also use the ``__not`` suffix to negate the search.
156+
157+
>>> results = pdbsearch.search(pdbx_entity_nonpoly__name__not="glucose")
158+
>>> results = pdbsearch.search(rcsb_nonpolymer_entity__pdbx_number_of_molecules__not=0)
159+
160+
There is a very similar search service called the ``text_chem`` service, which
161+
has a different set of `attributes <https://search.rcsb.org/chemical-search-attributes.html>`_
162+
(in practice a subset of the structure attributes) but which works the
163+
same way. It searches properties of chemical compounds, such as formula weight.
164+
The ``search`` function will default to using this service if you are searching
165+
for small molecules.
166+
167+
>>> results = pdbsearch.search(return_type="mol_definition", chem_comp__formula_weight__lt=1000)
168+
169+
Return Types
170+
############
171+
172+
The above examples all search for entries (i.e. PDB files) but the RCSB API
173+
also lets you search for other types of objects, such as polymer entities,
174+
non-polymer entities, and chemical compounds.
175+
176+
The type of object you search for is determined by the ``return_type`` parameter.
177+
The possible values are: ``entry``, ``assembly``, ``polymer_entity``,
178+
``non_polymer_entity``, ``polymer_instance``, ``mol_definition``. Where the term
179+
you are searching for does not correspond to the type of object you are
180+
searching for (i.e. doing a sequence search but asking for non-polymer
181+
entities), the API will use entries as a base (i.e finding all entries with a
182+
polymer entity matching your sequence, and then returning all non-polymer
183+
entities in those entries).
184+
185+
Alternatively, there are specific functions for searching for each of these types
186+
of objects - ``pdbsearch.search_entries``, ``pdbsearch.search_assemblies``, etc.
187+
188+
Multiple Queries
189+
################
190+
191+
You can combine any of the above parameters to search multiple services at once.
192+
These will be combined with an ``and`` operator.
193+
194+
>>> results = pdbsearch.search(
195+
return_type="polymer_entity",
196+
term="thymidine kinase",
197+
chem_comp__formula_weight__lt=1000,
198+
pdbx_struct_assembly__details__not__contains="good",
199+
protein="MALWMRLLPLLALLALWGPDPAAA",
200+
dna="ATGC",
201+
rna="AUGC",
202+
identity=0.95,
203+
evalue=1e-10,
204+
structure="1A2B-2",
205+
operator="relaxed_shape_match",
206+
entry="1A2B",
207+
residues=(("A", 1), ("B", 2)),
208+
rmsd=0.5,
209+
exchanges={("A", 1): ["ASP"], ("B", 2): ["HIS"]},
210+
smiles="CC(C)C",
211+
inchi="InChI=1S/C6H12/c1-2-4-6-5-3-1/h1-6H2",
212+
match_type="graph-relaxed-stereo",
213+
)
214+
215+
Request Options
216+
################
217+
218+
You can control how the search is performed and returned with various request
219+
option parameters:
220+
221+
>>> # Return all results in one response
222+
>>> results = pdbsearch.search(term="thymidine kinase", return_all=True)
223+
>>>
224+
>>> # Return only the count of results
225+
>>> results = pdbsearch.search(term="thymidine kinase", counts_only=True)
226+
>>>
227+
>>> # Return results starting from the 10th result
228+
>>> results = pdbsearch.search(term="thymidine kinase", start=10)
229+
>>>
230+
>>> # Return 20 results at a time
231+
>>> results = pdbsearch.search(term="thymidine kinase", rows=20)
232+
>>>
233+
>>> # Sort results by the deposit date (descending)
234+
>>> results = pdbsearch.search(term="thymidine kinase", sort="-rcsb_accession_info.deposit_date")
235+
>>>
236+
>>> # Sort results by the polymer entity count (ascending)
237+
>>> results = pdbsearch.search(term="thymidine kinase", sort="rcsb_assembly_info.polymer_entity_count")
238+
>>>
239+
>>> # Sort results by multiple attributes
240+
>>> results = pdbsearch.search(term="thymidine kinase", sort=["-rcsb_accession_info.deposit_date", "rcsb_assembly_info.polymer_entity_count"])
241+
>>>
242+
>>> # Return results with computational content type only
243+
>>> results = pdbsearch.search(term="thymidine kinase", content_types=["computational"])
244+
>>>
245+
>>> # Use the API's facets functionality (see their documentation for more details)
246+
>>> results = pdbsearch.search(term="thymidine kinase", facets=[...])
247+
248+
Nodes and Queries
249+
~~~~~~~~~~~~~~~~~
250+
251+
The ``pdbsearch.search`` function is useful for simple queries, but it has some limitations.
252+
253+
1. If using multiple queries, they will always be combined with an `and` operator.
254+
2. You can only provide one value per argument - if you have multiple protein sequences, you can't search for them all at once.
255+
256+
It can sometimes be useful to access the underlying node system that
257+
``pdbsearch.search`` is built on for more complex queries. This solves
258+
both of the above limitations.
259+
260+
Nodes
261+
#####
262+
263+
Each of the search services has a function for creating a single search node.
264+
265+
>>> # Full text search node
266+
>>> node = pdbsearch.full_text_node(term="thymidine kinase")
267+
>>>
268+
>>> # Text search node
269+
>>> node = pdbsearch.text_node(pdbx_struct_assembly__details__not__contains="good")
270+
>>>
271+
>>> # Text chem search node
272+
>>> node = pdbsearch.text_chem_node(chem_comp__formula_weight__lt=1000)
273+
>>>
274+
>>> # Sequence search nodes
275+
>>> node = pdbsearch.sequence_node(protein="MALWMRLLPLLALLALWGPDPAAA", identity=0.95, evalue=1e-10)
276+
>>> node = pdbsearch.sequence_node(dna="ATGC", identity=0.95, evalue=1e-10)
277+
>>> node = pdbsearch.sequence_node(rna="AUGC", identity=0.95, evalue=1e-10)
278+
>>>
279+
>>> # Sequence motif search node
280+
>>> node = pdbsearch.seqmotif_node(protein="C-X-C-X(2)-[LIVMYFWC]", pattern_type="prosite")
281+
>>>
282+
>>> # Structure search node
283+
>>> node = pdbsearch.structure_node("1A2B-2", operator="relaxed_shape_match")
284+
>>>
285+
>>> # Structure motif search node
286+
>>> node = pdbsearch.strucmotif_node("1A2B", residues=(("A", 1), ("B", 2)), rmsd=0.5, exchanges={("A", 1): ["ASP"], ("B", 2): ["HIS"]})
287+
>>>
288+
>>> # Chemical search nodes
289+
>>> node = pdbsearch.chemical_node(smiles="CC(C)C", match_type="graph-relaxed-stereo")
290+
>>> node = pdbsearch.chemical_node(inchi="InChI=1S/C6H12/c1-2-4-6-5-3-1/h1-6H2", match_type="graph-relaxed-stereo")
291+
292+
You can execute any of these nodes individually using their
293+
``pdbsearch.query`` method. These can take a ``return_type`` parameter,
294+
and all of the request option parameters.
295+
296+
>>> results = node.query("entry", return_all=True, sort="-rcsb_accession_info.deposit_date")
297+
298+
Combining Nodes
299+
###############
300+
301+
All node objects have an ``and_`` and ``or_`` method, which can be used to combine
302+
them with other nodes.
303+
304+
>>> node1 = pdbsearch.full_text_node(term="thymidine kinase")
305+
>>> node2 = pdbsearch.text_node(pdbx_struct_assembly__details__not__contains="good")
306+
>>> node3 = pdbsearch.sequence_node(protein="MALWMRLLPLLALLALWGPDPAAA", identity=0.95, evalue=1e-10)
307+
>>> node4 = pdbsearch.sequence_node(dna="ATGC", identity=0.95, evalue=1e-10)
308+
>>> node5 = pdbsearch.sequence_node(rna="AUGC", identity=0.95, evalue=1e-10)
309+
>>> node = node1.and_(node2).or_(node3.and_(node4.or_(node5)))
310+
>>> results = node.query("entry", return_all=True, sort="-rcsb_accession_info.deposit_date")
311+
312+
Schemas
313+
~~~~~~~
107314

108-
>>> pdbsearch.search(limit=5, sort="code")
109-
['9XIM', '9XIA', '9WGA', '9RUB', '9RSA']
110-
>>> pdbsearch.search(limit=5, sort="-resolution")
111-
['3NIR', '5D8V', '1EJG', '3P4J', '5NW3']
315+
The text and text_chem services have a schema that defines the attributes you can
316+
search on. These can be read `here <https://search.rcsb.org/structure-search-attributes.html>`_
317+
and `here <https://search.rcsb.org/chemical-search-attributes.html>`_ respectively.
112318

113-
You can sort by multiple criteria:
319+
They are also available as JSON schema objects,
320+
`here <https://search.rcsb.org/rcsbsearch/v2/metadata/schema>`_ and
321+
`here <https://search.rcsb.org/rcsbsearch/v2/metadata/chemical/schema>`_
322+
respectively. This is important as pdbsearch needs to know type information
323+
about the attributes in order to know which operator to use sometimes, and it
324+
needs to know which parameter names correspond to this service when parsing a
325+
``pdbsearch.search`` function call.
114326

115-
>>> pdbsearch.search(limit=5, sort=["-atoms", "released"])
116-
['1ANP', '6UOU', '6UOW', '1Q7O', '6QTF']
327+
For this reason, a simplified form of the schema (all attributes, but only the
328+
information about them pdbsearch needs) is hardcoded into the library. To ensure
329+
the library always uses the most up to date information, it will try to update
330+
its own local copy of the schema from the RCSB API when the library is imported.
117331

118-
Search Criteria
119-
~~~~~~~~~~~~~~~
332+
This can be disabled by setting the ``PDBSEARCH_NO_UPDATE`` environment variable.
120333

121-
You can search by passing keywords to the search function:
334+
You can also download the full schema using a CLI command::
122335

123-
>>> pdbsearch.search(limit=5, ligand_name="ZN")
124-
['3HW7', '3I7I', '3I7G', '2WFX', '2WGT']
336+
pdbsearch schema > schema.json
337+
pdbsearch schema --chemical --indent 4 > chemical_schema.json
125338

126-
You can modify the operator used with double underscores:
339+
The downloaded schema information will be cached locally, so that it doesn't fetch
340+
the schema every time pdbsearch runs - to delete this local cache, you can run::
127341

128-
>>> pdbsearch.search(limit=5, ligand_name__in=["ZN", "CU"])
129-
['3HW7', '3I7I', '3I7G', '2WFX', '2WGT']
130-
>>> pdbsearch.search(limit=5, resolution__lt=2)
131-
['3HW3', '3I83', '3HVS', '3HW4', '3HW5']
132-
>>> pdbsearch.search(limit=5, atoms__within=[200, 300])
133-
['2WH9', '2WPY', '395D', '396D', '2X8Q']
342+
pdbsearch clearschema
134343

135-
These are some shorthands, but you can search by any of the terms in the above
136-
linked list by replacing the dot with a double underscore:
344+
Changelog
345+
---------
137346

138-
>>> pdbsearch.search(limit=5, citation__rcsb_authors="Sula, A.")
139-
['4CAH', '4CAI', '4X8A', '4X88', '4X89']
347+
Release 0.5.0
348+
~~~~~~~~~~~~~
140349

141-
If you use more than one term, they will be combined with AND operators:
350+
`8 Jan 2026`
142351

143-
>>> pdbsearch.search(limit=5, ligand_name="ZN", atoms__within=[200, 300])
144-
['3WUP', '3ZNF', '2YTA', '2YTB', '2YSV']
352+
* Overhauled library for new RCSB search API structure.
145353

146-
Changelog
147-
---------
148354

149355
Release 0.4.0
150356
~~~~~~~~~~~~~

0 commit comments

Comments
 (0)