Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasional very slow response times, that could only be fixed by service restart (Windows) #1319

Closed
arekdygas opened this issue May 9, 2018 · 5 comments
Labels

Comments

@arekdygas
Copy link

Current Behavior

We have CouchDB (with dreyfus/Clouseau) running on Windows 2012 R2. Lately we observed, that after some period of time, requests that normally are handled in milliseconds, take a lot more to complete. For example, running series of request:

  • GET db info
  • GET all design documents in db (_all_docs view query)
  • GET info about all design documents in db (about 70)

took over 1 minute, and normally it takes 1-2 seconds. After restart everything works as expected.

When problem occurs, we see no db activity like indexing, compaction, etc. Windows performance counters indicate normal system resources usage (CPU about 5-10%, memory about 30%, no heavy disk usage). There are no errors/warnings in CouchDB/Clouseau logs, and no errors in Windows event logs.

Sometimes this happens after CouchDB is running for a week or two, but recently it happened after about 18 hours from last restart. Unfortunately, when such a problem occurs, we have to restart the server ASAP, and have no time to start some real investigation.

Do you have any idea what might be happening? Maybe some hints what we should check/modify in config files?

Steps to Reproduce (for bugs)

No idea.

Your Environment

  • Version used: 2.1.1 built using couchdb-glazier, with Lucene full text search (dreyfus and clouseau), single node, single shard
  • Operating System and version (desktop or mobile): Windows Server 2012 R2 (virtual machine with 16GB RAM, 4 virtual processors)
  • Databases: 360 total, about 30 in real use (there's continuous replication running for these 30 databases, initiated on backup server)
@wohali
Copy link
Member

wohali commented May 9, 2018

Are you using attachments? If so, how many per document, and how large is each attachment? What is the maximum size of each attachment?

If so, there is a known issue with excessive CPU/RAM usage in this scenario that will be fixed with the release of 2.2.0, see #745 .

@wohali wohali added the windows label May 9, 2018
@DanKrt82
Copy link

DanKrt82 commented May 10, 2018

We had the same problem on a installation of CouchDB (2.1.1) on a Windows Server 2012 R2 (not virtual). We saw the problem only on the DELETE request to CouchDB.
We saw that the CPU was stable and low and also the RAM usage. We don't use any attachment on our environment. Unfortunately I have that problem on a critical machine and I need to solve the problem as soon as possible, so I cannot do many investigation and troubleshooting. I restarted the server and that didn't resolve the problem, after some time (some houres) the problem dissappear.

@arekdygas
Copy link
Author

Thank you for reply Joan.

Yes - we use attachments. Generally they are quite small (up to 100KB), but some of them are larger (4-5MB). Currently we have a size limit set to 10MB by our application layer, but I have to check for any bigger attachments, that might got added to CouchDB before this limit was introduced.

I checked the issue you linked, and I can confirm that there are errors like that in our backup db (replication initiator). Unfortunately, they occurred two days before actual issue, so I believe this might be unrelated.

@wohali
Copy link
Member

wohali commented May 10, 2018

Please note that CouchDB on Windows is not intended as a production quality platform. It may work for you, but there are reports of performance issues that are as of yet unresolved. We know that running the exact same load on Linux on the exact same machine configuration is significantly more performant! Further, we have no resources to help investigate these problems.

@DanielMidali that sounds unrelated.

@arekdygas Can you try adding the following to your etc\vm.args file:

-ssl session_lifetime 300

Then, restart the CouchDB service.

@arekdygas
Copy link
Author

It looks like the problem was not related to replication after all.

We installed another CouchDB instance yesterday, to test some application changes on separate environment. One of our clients who was testing the changes, reported that system does not respond to requests. As this was just a test db, we were able to experiment with it a bit.

We discovered that the problem started when our app sent hundreds of requests to CouchDB - at some point in time db responses became slower and slower, unless they were so slow, that it looked like db was not responding at all. In the end, the direct cause of the problem was the management of connections in our application - instead of reusing the single connection, or some connections pool, each request created a separate connection. It looks like we reached some CouchDB limit, but weren't sure how to increase it.

We modified max_dbs_open to 5000 in default.ini and added '+Q 65536' in vm.args (this should work like ERL_MAX_PORTS, but maybe I'm wrong). We also checked the suggested '-ssl session_lifetime 300' (just in case), but unfortunately nothing helped.

Finally, we changed connections management in our application, so that now they are reused, and the problem doesn't occur anymore. This is great, but I must admit, that I'd be glad if I knew what happened on db side :)

As to the Windows platform - I know this OS support is not a priority for your team, but I didn't know that running CouchDB on Windows was not supposed in production. We will think on changing the environment.

Thank you very much for hints and help Joan!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants