Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upper bound of 10000 queries means I can't access the entirety of INSPIRE institutions #20

Open
smeehan12 opened this issue Dec 17, 2021 · 3 comments

Comments

@smeehan12
Copy link

I am trying to use the API to scrape the geographical distribution information for publications throughout the world to get a handle on the differences of publications by institutions located in different regions of the world. As such, I am trying to make calls to URLs like

https://inspirehep.net/api/institutions?sort=mostrecent&size=1&page=1

which allows me to query the metadata associated with the insitutional publication records.

This works well and allows me to get all the information I need. However, there seems to be an upper limit on being able to access all of the data because when I try a call like

https://inspirehep.net/api/institutions?sort=mostrecent&size=10&page=1001

I get a return of

{"status": 400, "message": "Maximum number of 10000 results have been reached."}

Now, I see that there is a maximum number of simultaneous returns that can be requested of 1000, but this upper bound of 10000 is causing issues because it means I can't access the data for the full set of 11791 institutions that have publications in HEP via this API.

Is there some reason why this upper bound exists? Or am I misusing the API?

@michamos
Copy link
Contributor

michamos commented Dec 20, 2021

You're doing everything right, this is an unfortunate limitation on our side (ElasticSearch is used as a search engine, but the API we're using for pagination has a limit at 10000 results). I hope we can improve this soon by switching to a different pagination mechanism, but in the meantime you can use the following workaround.

Add to the search query (which is empty in your case) an additional filter ensuring that you receive less than 10000 results back for a single search, then manually change the values you're filtering on. It's convenient to use a range of control_number values for this, as all records are guaranteed to contain exactly one control_number. For Institutions (which uses the standard ES query_string parser), this would look like

https://inspirehep.net/api/institutions?sort=mostrecent&size=1&page=1&q=control_number%3A[1 TO 1000000]
https://inspirehep.net/api/institutions?sort=mostrecent&size=1&page=1&q=control_number%3A[1000001 TO 2000000]

For Literature, which has a much higher density of records and uses a custom query parser, you'd do something like

https://inspirehep.net/api/literature?sort=mostrecent&size=1&page=1&q=control_number%3A1->10000
https://inspirehep.net/api/literature?sort=mostrecent&size=1&page=1&q=control_number%3A10001->20000
[etc.]

@michamos michamos pinned this issue Dec 20, 2021
michamos added a commit that referenced this issue Dec 20, 2021
@smeehan12
Copy link
Author

What you write here is working very well. Thank you for adding it to the documentation. I think it will be clear how to circumvent this issue if someone starts using the API for their own project and find issues.

Please keep up the great work on developing this infrastructure, it is crucial for meta-analyses and I look forward to sharing the results of our project with you when they come to fruition!

@javadebadi
Copy link
Contributor

Hi @michamos
It would be nice to get the list of all available control_numbers or record ids.
If it is possible, probably the 10000 upper bound would not be a serious problem for now.
I think that if you add the following new routes, it will be very beneficial to users:

  1. a route to get a list of all author ids in Inspirehep (/api/authors/ids/)
  2. a route to get a list of all institution ids in Inspirehep (/api/institutions/ids)
  3. a route to get a list of all literature ids in Inspirehep (/api/literature/ids)
  4. a route to get a list of all seminar ids in Inspirehep (/api/seminars/ids)
  5. a route to get a list of all job ids in Inspirehep (/api/jobs/ids/)
  6. a route to get a list of all conference ids in Inspirehep (/api/conferences/ids)
    1. a route to get a list of all experiment ids in Inspirehep (/api/experiments/ids)

Probably there could be a natural sorting for all the objects and the users can get top 10, 1000, etc ...
items in the list by specifying a query parameter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants