11pdbsearch
22=========
33
4- |travis | |coveralls | |pypi | |version | |commit |
4+ |ci | |version | |pypi | |license | |commit |
55
6- .. |travis | image :: https://api.travis-ci.org /samirelanduk/pdbsearch. svg?branch=master
7- :target: https://travis-ci.org /samirelanduk/pdbsearch/
6+ .. |ci | image :: https://github.com /samirelanduk/pdbsearch/actions/workflows/main.yml/badge. svg
7+ :target: https://github.com /samirelanduk/pdbsearch/actions/workflows/main.yml
88
9- .. |coveralls | image :: https://coveralls. io/repos/github/samirelanduk/ pdbsearch/badge .svg?branch=master
10- :target: https://coveralls.io/github/samirelanduk /pdbsearch/
9+ .. |version | image :: https://img.shields. io/pypi/v/ pdbsearch.svg
10+ :target: https://pypi.org/project /pdbsearch/
1111
1212.. |pypi | image :: https://img.shields.io/pypi/pyversions/pdbsearch.svg
1313 :target: https://pypi.org/project/pdbsearch/
1414
15- .. |version | image :: https://img.shields.io/pypi/v /pdbsearch.svg
16- :target: https://pypi.org/project /pdbsearch/
15+ .. |license | image :: https://img.shields.io/pypi/l /pdbsearch.svg?color=blue
16+ :target: https://github.com/samirelanduk /pdbsearch/blob/master/LICENSE
1717
1818.. |commit | image :: https://img.shields.io/github/last-commit/samirelanduk/pdbsearch/master.svg
1919 :target: https://github.com/samirelanduk/pdbsearch/tree/master/
@@ -25,9 +25,13 @@ Example
2525-------
2626
2727 >>> import pdbsearch
28- >>> codes = pdbsearch.search(limit = 5 , ligand_name = " CU" )
29- >>> codes
30- ['3HW7', '2WKO', '2WOF', '2WOH', '2WO0']
28+ >>> results = pdbsearch.search(rows = 5 , chem_comp__name__contains = " zinc" )
29+ >>> print (results[" total_count" ])
30+ 26
31+ >>> print (results[" result_set" ])
32+ [{'identifier': '1A0B', 'score': 1.0}, {'identifier': '1A1F', 'score': 1.0},
33+ {'identifier': '1A1G', 'score': 1.0}, {'identifier': '1A1H', 'score': 1.0},
34+ {'identifier': '1A1I', 'score': 1.0}]
3135
3236
3337Installing
@@ -45,16 +49,6 @@ If you get permission errors, try using ``sudo``:
4549``$ sudo pip install pdbsearch ``
4650
4751
48- Development
49- ~~~~~~~~~~~
50-
51- The repository for pdbsearch, containing the most recent iteration, can be
52- found `here <http://github.com/samirelanduk/pdbsearch/ >`_. To clone the
53- pdbsearch repository directly from there, use:
54-
55- ``$ git clone git://github.com/samirelanduk/pdbsearch.git ``
56-
57-
5852Requirements
5953~~~~~~~~~~~~
6054
@@ -80,71 +74,283 @@ Overview
8074pdbsearch is a Python library for searching for PDB structures using the
8175RCSB web services.
8276
83- Returning all PDB Codes
84- ~~~~~~~~~~~~~~~~~~~~~~~
77+ Basic Search
78+ ~~~~~~~~~~~~
8579
86- You can get all PDB codes without any particular search expression like so :
80+ The default search will return PDB entry IDs, with no filtering :
8781
8882 >>> import pdbsearch
89- >>> codes = pdbsearch.search(limit = None )
90- >>> len (codes)
91- 174994
92-
93- This will take a few seconds, and requires downloading a rather large JSON
94- object over the network. Generally it is better to paginate the results:
95-
96- >>> first_ten_codes = pdbsearch.search(limit = 10 )
97- >>> second_ten_codes = pdbsearch.search(start = 10 , limit = 10 )
98- >>> third_ten_codes = pdbsearch.search(start = 20 , limit = 10 )
99-
100- You can sort the results by any of the terms at
101- `<https://search.rcsb.org/structure-search-attributes.html >`_:
102-
103- >>> most_recent_codes = pdbsearch.search(sort = " rcsb_accession_info.deposit_date" )
104- >>> earliest_codes = pdbsearch.search(sort = " -rcsb_accession_info.deposit_date" )
105-
106- As these are somewhat cumbersome, some of them have a shorthand:
83+ >>> results = pdbsearch.search()
84+ >>> print (results)
85+ {'query_id': '9a0c3543-0e29-462c-8357-f286293d9896', 'result_type': 'entry',
86+ 'total_count': 247417, 'result_set': [{'identifier': '100D', 'score': 1.0},
87+ {'identifier': '101D', 'score': 1.0}, {'identifier': '101M', 'score': 1.0},
88+ {'identifier': '102D', 'score': 1.0}, {'identifier': '102L', 'score': 1.0},
89+ {'identifier': '102M', 'score': 1.0}, {'identifier': '103D', 'score': 1.0},
90+ {'identifier': '103L', 'score': 1.0}, {'identifier': '103M', 'score': 1.0},
91+ {'identifier': '104D', 'score': 1.0}]}
92+
93+ The JSON returned is the direct output from the RCSB search API. By default, it
94+ returns 10 results at a time.
95+
96+ Services
97+ ########
98+
99+ RCSB defines a number of services which search in different ways. For example,
100+ the full text search service will search the entire PDB database for the given
101+ term, and you can access this with the ``term `` keyword argument:
102+
103+ >>> results = pdbsearch.search(term = " thymidine kinase" )
104+
105+ You can search for entries with other services, using the correct keyword
106+ arguments:
107+
108+ >>> # Sequence service
109+ >>> results = pdbsearch.search(protein = " MALWMRLLPLLALLALWGPDPAAA" )
110+ >>> results = pdbsearch.search(dna = " ATGC" , identity = 0.95 , evalue = 1e-10 )
111+ >>> results = pdbsearch.search(rna = " AUGC" )
112+ >>>
113+ >>> # Sequence motif service
114+ >>> results = pdbsearch.search(protein = " C-X-C-X(2)-[LIVMYFWC]" , pattern_type = " prosite" )
115+ >>> results = pdbsearch.search(dna = " GTXXCA" , pattern_type = " simple" )
116+ >>> results = pdbsearch.search(rna = " AUG{2} C" , pattern_type = " regex" )
117+ >>>
118+ >>> # Structure service (requires the ID of a specific assembly to look for)
119+ >>> results = pdbsearch.search(structure = " 1A2B-2" , operator = " relaxed_shape_match" )
120+ >>>
121+ >>> # Structure motif service (requires a residue pattern in some entry)
122+ >>> results = pdbsearch.search(entry = " 1A2B" , residues = ((" A" , 1 ), (" B" , 2 )))
123+ >>>
124+ >>> # Chemical service
125+ >>> results = pdbsearch.search(smiles = " CC(C)C" , match_type = " graph-relaxed-stereo" )
126+ >>> results = pdbsearch.search(inchi = " InChI=1S/C6H12/c1-2-4-6-5-3-1/h1-6H2" )
127+
128+ The most useful service however, is the text service. The documentation for the
129+ RCSB search API lists a number of attributes that you can search on, and which
130+ can be viewed `here <https://search.rcsb.org/structure-search-attributes.html >`_.
131+ For example, ``pdbx_entity_nonpoly.name `` or
132+ ``rcsb_nonpolymer_entity.pdbx_number_of_molecules ``. You pass these as keyword
133+ arguments to the ``pdbsearch.search `` function, with the prefix ``__ ``
134+ in place of the dot.
135+
136+ >>> results = pdbsearch.search(pdbx_entity_nonpoly__name = " glucose" )
137+ >>> results = pdbsearch.search(rcsb_nonpolymer_entity__pdbx_number_of_molecules = 0 )
138+
139+ As the `RCSB documentation <https://search.rcsb.org/#search-api >`_ indicates, you can use a variety of operators to
140+ modify how you search, and these are encoded as suffixes to the keyword
141+ arguments:
142+
143+ - ``__gt ``: ``greater ``
144+ - ``__lt ``: ``less ``
145+ - ``__gte ``: ``greater_or_equal ``
146+ - ``__lte ``: ``less_or_equal ``
147+ - ``__in ``: ``in ``
148+ - ``__exists ``: ``exists ``
149+ - ``__range ``: ``range `` (use ``tuple `` for exclusive range, ``list `` for inclusive range)
150+ - ``__contains ``: ``contains_phrase ``
151+ - ``__contains_phrase ``: ``contains_phrase ``
152+ - ``__contains_words ``: ``contains_words ``
153+ - ``__not ``: ``not ``
154+
155+ You can also use the ``__not `` suffix to negate the search.
156+
157+ >>> results = pdbsearch.search(pdbx_entity_nonpoly__name__not = " glucose" )
158+ >>> results = pdbsearch.search(rcsb_nonpolymer_entity__pdbx_number_of_molecules__not = 0 )
159+
160+ There is a very similar search service called the ``text_chem `` service, which
161+ has a different set of `attributes <https://search.rcsb.org/chemical-search-attributes.html >`_
162+ (in practice a subset of the structure attributes) but which works the
163+ same way. It searches properties of chemical compounds, such as formula weight.
164+ The ``search `` function will default to using this service if you are searching
165+ for small molecules.
166+
167+ >>> results = pdbsearch.search(return_type = " mol_definition" , chem_comp__formula_weight__lt = 1000 )
168+
169+ Return Types
170+ ############
171+
172+ The above examples all search for entries (i.e. PDB files) but the RCSB API
173+ also lets you search for other types of objects, such as polymer entities,
174+ non-polymer entities, and chemical compounds.
175+
176+ The type of object you search for is determined by the ``return_type `` parameter.
177+ The possible values are: ``entry ``, ``assembly ``, ``polymer_entity ``,
178+ ``non_polymer_entity ``, ``polymer_instance ``, ``mol_definition ``. Where the term
179+ you are searching for does not correspond to the type of object you are
180+ searching for (i.e. doing a sequence search but asking for non-polymer
181+ entities), the API will use entries as a base (i.e finding all entries with a
182+ polymer entity matching your sequence, and then returning all non-polymer
183+ entities in those entries).
184+
185+ Alternatively, there are specific functions for searching for each of these types
186+ of objects - ``pdbsearch.search_entries ``, ``pdbsearch.search_assemblies ``, etc.
187+
188+ Multiple Queries
189+ ################
190+
191+ You can combine any of the above parameters to search multiple services at once.
192+ These will be combined with an ``and `` operator.
193+
194+ >>> results = pdbsearch.search(
195+ return_type="polymer_entity",
196+ term="thymidine kinase",
197+ chem_comp__formula_weight__lt=1000,
198+ pdbx_struct_assembly__details__not__contains="good",
199+ protein="MALWMRLLPLLALLALWGPDPAAA",
200+ dna="ATGC",
201+ rna="AUGC",
202+ identity=0.95,
203+ evalue=1e-10,
204+ structure="1A2B-2",
205+ operator="relaxed_shape_match",
206+ entry="1A2B",
207+ residues=(("A", 1), ("B", 2)),
208+ rmsd=0.5,
209+ exchanges={("A", 1): ["ASP"], ("B", 2): ["HIS"]},
210+ smiles="CC(C)C",
211+ inchi="InChI=1S/C6H12/c1-2-4-6-5-3-1/h1-6H2",
212+ match_type="graph-relaxed-stereo",
213+ )
214+
215+ Request Options
216+ ################
217+
218+ You can control how the search is performed and returned with various request
219+ option parameters:
220+
221+ >>> # Return all results in one response
222+ >>> results = pdbsearch.search(term = " thymidine kinase" , return_all = True )
223+ >>>
224+ >>> # Return only the count of results
225+ >>> results = pdbsearch.search(term = " thymidine kinase" , counts_only = True )
226+ >>>
227+ >>> # Return results starting from the 10th result
228+ >>> results = pdbsearch.search(term = " thymidine kinase" , start = 10 )
229+ >>>
230+ >>> # Return 20 results at a time
231+ >>> results = pdbsearch.search(term = " thymidine kinase" , rows = 20 )
232+ >>>
233+ >>> # Sort results by the deposit date (descending)
234+ >>> results = pdbsearch.search(term = " thymidine kinase" , sort = " -rcsb_accession_info.deposit_date" )
235+ >>>
236+ >>> # Sort results by the polymer entity count (ascending)
237+ >>> results = pdbsearch.search(term = " thymidine kinase" , sort = " rcsb_assembly_info.polymer_entity_count" )
238+ >>>
239+ >>> # Sort results by multiple attributes
240+ >>> results = pdbsearch.search(term = " thymidine kinase" , sort = [" -rcsb_accession_info.deposit_date" , " rcsb_assembly_info.polymer_entity_count" ])
241+ >>>
242+ >>> # Return results with computational content type only
243+ >>> results = pdbsearch.search(term = " thymidine kinase" , content_types = [" computational" ])
244+ >>>
245+ >>> # Use the API's facets functionality (see their documentation for more details)
246+ >>> results = pdbsearch.search(term = " thymidine kinase" , facets = [... ])
247+
248+ Nodes and Queries
249+ ~~~~~~~~~~~~~~~~~
250+
251+ The ``pdbsearch.search `` function is useful for simple queries, but it has some limitations.
252+
253+ 1. If using multiple queries, they will always be combined with an `and ` operator.
254+ 2. You can only provide one value per argument - if you have multiple protein sequences, you can't search for them all at once.
255+
256+ It can sometimes be useful to access the underlying node system that
257+ ``pdbsearch.search `` is built on for more complex queries. This solves
258+ both of the above limitations.
259+
260+ Nodes
261+ #####
262+
263+ Each of the search services has a function for creating a single search node.
264+
265+ >>> # Full text search node
266+ >>> node = pdbsearch.full_text_node(term = " thymidine kinase" )
267+ >>>
268+ >>> # Text search node
269+ >>> node = pdbsearch.text_node(pdbx_struct_assembly__details__not__contains = " good" )
270+ >>>
271+ >>> # Text chem search node
272+ >>> node = pdbsearch.text_chem_node(chem_comp__formula_weight__lt = 1000 )
273+ >>>
274+ >>> # Sequence search nodes
275+ >>> node = pdbsearch.sequence_node(protein = " MALWMRLLPLLALLALWGPDPAAA" , identity = 0.95 , evalue = 1e-10 )
276+ >>> node = pdbsearch.sequence_node(dna = " ATGC" , identity = 0.95 , evalue = 1e-10 )
277+ >>> node = pdbsearch.sequence_node(rna = " AUGC" , identity = 0.95 , evalue = 1e-10 )
278+ >>>
279+ >>> # Sequence motif search node
280+ >>> node = pdbsearch.seqmotif_node(protein = " C-X-C-X(2)-[LIVMYFWC]" , pattern_type = " prosite" )
281+ >>>
282+ >>> # Structure search node
283+ >>> node = pdbsearch.structure_node(" 1A2B-2" , operator = " relaxed_shape_match" )
284+ >>>
285+ >>> # Structure motif search node
286+ >>> node = pdbsearch.strucmotif_node(" 1A2B" , residues = ((" A" , 1 ), (" B" , 2 )), rmsd = 0.5 , exchanges = {(" A" , 1 ): [" ASP" ], (" B" , 2 ): [" HIS" ]})
287+ >>>
288+ >>> # Chemical search nodes
289+ >>> node = pdbsearch.chemical_node(smiles = " CC(C)C" , match_type = " graph-relaxed-stereo" )
290+ >>> node = pdbsearch.chemical_node(inchi = " InChI=1S/C6H12/c1-2-4-6-5-3-1/h1-6H2" , match_type = " graph-relaxed-stereo" )
291+
292+ You can execute any of these nodes individually using their
293+ ``pdbsearch.query `` method. These can take a ``return_type `` parameter,
294+ and all of the request option parameters.
295+
296+ >>> results = node.query(" entry" , return_all = True , sort = " -rcsb_accession_info.deposit_date" )
297+
298+ Combining Nodes
299+ ###############
300+
301+ All node objects have an ``and_ `` and ``or_ `` method, which can be used to combine
302+ them with other nodes.
303+
304+ >>> node1 = pdbsearch.full_text_node(term = " thymidine kinase" )
305+ >>> node2 = pdbsearch.text_node(pdbx_struct_assembly__details__not__contains = " good" )
306+ >>> node3 = pdbsearch.sequence_node(protein = " MALWMRLLPLLALLALWGPDPAAA" , identity = 0.95 , evalue = 1e-10 )
307+ >>> node4 = pdbsearch.sequence_node(dna = " ATGC" , identity = 0.95 , evalue = 1e-10 )
308+ >>> node5 = pdbsearch.sequence_node(rna = " AUGC" , identity = 0.95 , evalue = 1e-10 )
309+ >>> node = node1.and_(node2).or_(node3.and_(node4.or_(node5)))
310+ >>> results = node.query(" entry" , return_all = True , sort = " -rcsb_accession_info.deposit_date" )
311+
312+ Schemas
313+ ~~~~~~~
107314
108- >>> pdbsearch.search(limit = 5 , sort = " code" )
109- ['9XIM', '9XIA', '9WGA', '9RUB', '9RSA']
110- >>> pdbsearch.search(limit = 5 , sort = " -resolution" )
111- ['3NIR', '5D8V', '1EJG', '3P4J', '5NW3']
315+ The text and text_chem services have a schema that defines the attributes you can
316+ search on. These can be read `here <https://search.rcsb.org/structure-search-attributes.html >`_
317+ and `here <https://search.rcsb.org/chemical-search-attributes.html >`_ respectively.
112318
113- You can sort by multiple criteria:
319+ They are also available as JSON schema objects,
320+ `here <https://search.rcsb.org/rcsbsearch/v2/metadata/schema >`_ and
321+ `here <https://search.rcsb.org/rcsbsearch/v2/metadata/chemical/schema >`_
322+ respectively. This is important as pdbsearch needs to know type information
323+ about the attributes in order to know which operator to use sometimes, and it
324+ needs to know which parameter names correspond to this service when parsing a
325+ ``pdbsearch.search `` function call.
114326
115- >>> pdbsearch.search(limit = 5 , sort = [" -atoms" , " released" ])
116- ['1ANP', '6UOU', '6UOW', '1Q7O', '6QTF']
327+ For this reason, a simplified form of the schema (all attributes, but only the
328+ information about them pdbsearch needs) is hardcoded into the library. To ensure
329+ the library always uses the most up to date information, it will try to update
330+ its own local copy of the schema from the RCSB API when the library is imported.
117331
118- Search Criteria
119- ~~~~~~~~~~~~~~~
332+ This can be disabled by setting the ``PDBSEARCH_NO_UPDATE `` environment variable.
120333
121- You can search by passing keywords to the search function :
334+ You can also download the full schema using a CLI command: :
122335
123- >>> pdbsearch.search( limit = 5 , ligand_name = " ZN " )
124- ['3HW7', '3I7I', '3I7G', '2WFX', '2WGT']
336+ pdbsearch schema > schema.json
337+ pdbsearch schema --chemical --indent 4 > chemical_schema.json
125338
126- You can modify the operator used with double underscores:
339+ The downloaded schema information will be cached locally, so that it doesn't fetch
340+ the schema every time pdbsearch runs - to delete this local cache, you can run::
127341
128- >>> pdbsearch.search(limit = 5 , ligand_name__in = [" ZN" , " CU" ])
129- ['3HW7', '3I7I', '3I7G', '2WFX', '2WGT']
130- >>> pdbsearch.search(limit = 5 , resolution__lt = 2 )
131- ['3HW3', '3I83', '3HVS', '3HW4', '3HW5']
132- >>> pdbsearch.search(limit = 5 , atoms__within = [200 , 300 ])
133- ['2WH9', '2WPY', '395D', '396D', '2X8Q']
342+ pdbsearch clearschema
134343
135- These are some shorthands, but you can search by any of the terms in the above
136- linked list by replacing the dot with a double underscore:
344+ Changelog
345+ ---------
137346
138- >>> pdbsearch.search( limit = 5 , citation__rcsb_authors = " Sula, A. " )
139- ['4CAH', '4CAI', '4X8A', '4X88', '4X89']
347+ Release 0.5.0
348+ ~~~~~~~~~~~~~
140349
141- If you use more than one term, they will be combined with AND operators:
350+ ` 8 Jan 2026 `
142351
143- >>> pdbsearch.search(limit = 5 , ligand_name = " ZN" , atoms__within = [200 , 300 ])
144- ['3WUP', '3ZNF', '2YTA', '2YTB', '2YSV']
352+ * Overhauled library for new RCSB search API structure.
145353
146- Changelog
147- ---------
148354
149355Release 0.4.0
150356~~~~~~~~~~~~~
0 commit comments