Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API speed and timeout issue while using MongoDB Backend #159

Open
shubhamcyble opened this issue May 15, 2022 · 16 comments
Open

API speed and timeout issue while using MongoDB Backend #159

shubhamcyble opened this issue May 15, 2022 · 16 comments

Comments

@shubhamcyble
Copy link

Hi Team,

I've setup the server for MongoDBBackend, and using a taxii2client to test the setup.

The setup is working fine when the count of collection objects is less than few thousands.
However, when the collection objects exceed over 100,000 , then the get_objects api is resulting into a 502 error and timing out even when the limit is restricted to 5 objects.

@slunak
Copy link

slunak commented May 16, 2022

Hi All, I am able to reproduce this issue with less objects however. To me it happens when object count in the collection is around 3,000 ~ 4,000. Both taxii server and MongoDB instances have enough resources. It happens on both locally and AWS environments. It takes about 5 minutes after the request for the following log snippet to be thrown:

May 16 11:31:44 ip-100-64-0-10 medallion[810]: [medallion] [ERROR ] [2022-05-16 11:31:44,742] Exception on /trustgroup1/collections/64993447-4d7e-4f70-b94d-d7f33742ee63/objects/ [GET]
May 16 11:31:44 ip-100-64-0-10 medallion[810]: Traceback (most recent call last):
May 16 11:31:44 ip-100-64-0-10 medallion[810]: File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 2073, in wsgi_app
May 16 11:31:44 ip-100-64-0-10 medallion[810]: response = self.full_dispatch_request()
May 16 11:31:44 ip-100-64-0-10 medallion[810]: File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1518, in full_dispatch_request
May 16 11:31:44 ip-100-64-0-10 medallion[810]: rv = self.handle_user_exception(e)
May 16 11:31:44 ip-100-64-0-10 medallion[810]: File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1516, in full_dispatch_request
May 16 11:31:44 ip-100-64-0-10 medallion[810]: rv = self.dispatch_request()
May 16 11:31:44 ip-100-64-0-10 medallion[810]: File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1502, in dispatch_request
May 16 11:31:44 ip-100-64-0-10 medallion[810]: return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
May 16 11:31:44 ip-100-64-0-10 medallion[810]: File "/usr/local/lib/python3.8/dist-packages/flask_httpauth.py", line 172, in decorated
May 16 11:31:44 ip-100-64-0-10 medallion[810]: return self.ensure_sync(f)(*args, **kwargs)
May 16 11:31:44 ip-100-64-0-10 medallion[810]: File "/usr/local/lib/python3.8/dist-packages/medallion/views/objects.py", line 115, in get_or_add_objects
May 16 11:31:44 ip-100-64-0-10 medallion[810]: objects, headers = current_app.medallion_backend.get_objects(
May 16 11:31:44 ip-100-64-0-10 medallion[810]: File "/usr/local/lib/python3.8/dist-packages/medallion/backends/mongodb_backend.py", line 27, in api_wrapper
May 16 11:31:44 ip-100-64-0-10 medallion[810]: return func(*args, **kwargs)
May 16 11:31:44 ip-100-64-0-10 medallion[810]: File "/usr/local/lib/python3.8/dist-packages/medallion/backends/mongodb_backend.py", line 237, in get_objects
May 16 11:31:44 ip-100-64-0-10 medallion[810]: manifest_resource = self._get_object_manifest(api_root, collection_id, filter_args, allowed_filters, limit, True)
May 16 11:31:44 ip-100-64-0-10 medallion[810]: File "/usr/local/lib/python3.8/dist-packages/medallion/backends/mongodb_backend.py", line 117, in _get_object_manifest
May 16 11:31:44 ip-100-64-0-10 medallion[810]: next_id, more = self._update_record(next_id, count, internal)
May 16 11:31:44 ip-100-64-0-10 medallion[810]: File "/usr/local/lib/python3.8/dist-packages/medallion/backends/mongodb_backend.py", line 74, in _update_record
May 16 11:31:44 ip-100-64-0-10 medallion[810]: if self.pages[next_id]["skip"] >= count:
May 16 11:31:44 ip-100-64-0-10 medallion[810]: KeyError: '2721dba1-29c6-4372-bffc-6469fc974051'
May 16 11:31:44 ip-100-64-0-10 medallion[810]: 100.64.3.155 - - [16/May/2022 11:31:44] "GET /trustgroup1/collections/64993447-4d7e-4f70-b94d-d7f33742ee63/objects/?limit=5000 HTTP/1.1" 500 -
May 16 11:31:44 ip-100-64-0-10 medallion[810]: Error on request:
May 16 11:31:44 ip-100-64-0-10 medallion[810]: Traceback (most recent call last):
May 16 11:31:44 ip-100-64-0-10 medallion[810]: File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 2073, in wsgi_app
May 16 11:31:44 ip-100-64-0-10 medallion[810]: response = self.full_dispatch_request()
May 16 11:31:44 ip-100-64-0-10 medallion[810]: File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1518, in full_dispatch_request
May 16 11:31:44 ip-100-64-0-10 medallion[810]: rv = self.handle_user_exception(e)
May 16 11:31:44 ip-100-64-0-10 medallion[810]: File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1516, in full_dispatch_request
May 16 11:31:44 ip-100-64-0-10 medallion[810]: rv = self.dispatch_request()
May 16 11:31:44 ip-100-64-0-10 medallion[810]: File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1502, in dispatch_request
May 16 11:31:44 ip-100-64-0-10 medallion[810]: return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
May 16 11:31:44 ip-100-64-0-10 medallion[810]: File "/usr/local/lib/python3.8/dist-packages/flask_httpauth.py", line 172, in decorated
May 16 11:31:44 ip-100-64-0-10 medallion[810]: return self.ensure_sync(f)(*args, **kwargs)
May 16 11:31:44 ip-100-64-0-10 medallion[810]: File "/usr/local/lib/python3.8/dist-packages/medallion/views/objects.py", line 115, in get_or_add_objects
May 16 11:31:44 ip-100-64-0-10 medallion[810]: objects, headers = current_app.medallion_backend.get_objects(
May 16 11:31:44 ip-100-64-0-10 medallion[810]: File "/usr/local/lib/python3.8/dist-packages/medallion/backends/mongodb_backend.py", line 27, in api_wrapper
May 16 11:31:44 ip-100-64-0-10 medallion[810]: return func(*args, **kwargs)
May 16 11:31:44 ip-100-64-0-10 medallion[810]: File "/usr/local/lib/python3.8/dist-packages/medallion/backends/mongodb_backend.py", line 237, in get_objects
May 16 11:31:44 ip-100-64-0-10 medallion[810]: manifest_resource = self._get_object_manifest(api_root, collection_id, filter_args, allowed_filters, limit, True)
May 16 11:31:44 ip-100-64-0-10 medallion[810]: File "/usr/local/lib/python3.8/dist-packages/medallion/backends/mongodb_backend.py", line 117, in _get_object_manifest
May 16 11:31:44 ip-100-64-0-10 medallion[810]: next_id, more = self._update_record(next_id, count, internal)
May 16 11:31:44 ip-100-64-0-10 medallion[810]: File "/usr/local/lib/python3.8/dist-packages/medallion/backends/mongodb_backend.py", line 74, in _update_record
May 16 11:31:44 ip-100-64-0-10 medallion[810]: if self.pages[next_id]["skip"] >= count:
May 16 11:31:44 ip-100-64-0-10 medallion[810]: KeyError: '2721dba1-29c6-4372-bffc-6469fc974051'
May 16 11:31:44 ip-100-64-0-10 medallion[810]: During handling of the above exception, another exception occurred:
May 16 11:31:44 ip-100-64-0-10 medallion[810]: Traceback (most recent call last):
May 16 11:31:44 ip-100-64-0-10 medallion[810]: File "/usr/local/lib/python3.8/dist-packages/werkzeug/serving.py", line 324, in run_wsgi
May 16 11:31:44 ip-100-64-0-10 medallion[810]: execute(self.server.app)
May 16 11:31:44 ip-100-64-0-10 medallion[810]: File "/usr/local/lib/python3.8/dist-packages/werkzeug/serving.py", line 313, in execute
May 16 11:31:44 ip-100-64-0-10 medallion[810]: application_iter = app(environ, start_response)
May 16 11:31:44 ip-100-64-0-10 medallion[810]: File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 2091, in call
May 16 11:31:44 ip-100-64-0-10 medallion[810]: return self.wsgi_app(environ, start_response)
May 16 11:31:44 ip-100-64-0-10 medallion[810]: File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 2076, in wsgi_app
May 16 11:31:44 ip-100-64-0-10 medallion[810]: response = self.handle_exception(e)
May 16 11:31:44 ip-100-64-0-10 medallion[810]: File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1440, in handle_exception
May 16 11:31:44 ip-100-64-0-10 medallion[810]: server_error = self.ensure_sync(handler)(server_error)
May 16 11:31:44 ip-100-64-0-10 medallion[810]: File "/usr/local/lib/python3.8/dist-packages/medallion/init.py", line 105, in handle_error
May 16 11:31:44 ip-100-64-0-10 medallion[810]: "description": str(error.args[0]),
May 16 11:31:44 ip-100-64-0-10 medallion[810]: IndexError: tuple index out of range

@rpiazza
Copy link
Contributor

rpiazza commented May 17, 2022

Have you tried using mongo db indexing? Here is what we do for testing:

id_index = IndexModel([("id", ASCENDING)])
type_index = IndexModel([("type", ASCENDING)])
collection_index = IndexModel([("_collection_id", ASCENDING)])
date_index = IndexModel([("_manifest.date_added", ASCENDING)])
version_index = IndexModel([("_manifest.version", ASCENDING)])
date_and_spec_index = IndexModel([("_manifest.media_type", ASCENDING), ("_manifest.date_added", ASCENDING)])
version_and_spec_index = IndexModel([("_manifest.media_type", ASCENDING), ("_manifest.version", ASCENDING)])
collection_and_date_index = IndexModel([("_collection_id", ASCENDING), ("_manifest.date_added", ASCENDING)])
api_db["objects"].create_indexes(
[id_index, type_index, date_index, version_index, collection_index, date_and_spec_index,
version_and_spec_index, collection_and_date_index]
)

Please tell us if this helps at all.

@chisholm
Copy link
Contributor

May 16 11:31:44 ip-100-64-0-10 medallion[810]: File "/usr/local/lib/python3.8/dist-packages/medallion/init.py", line 105, in handle_error
May 16 11:31:44 ip-100-64-0-10 medallion[810]: "description": str(error.args[0]),
May 16 11:31:44 ip-100-64-0-10 medallion[810]: IndexError: tuple index out of range

Looks like we don't get a proper error response because the error handler has a bug. I think this has been fixed in PR #145 , but the fix hasn't made it into a release. So there are actually two issues at play here: the original error, and the error handler error.

May 16 11:31:44 ip-100-64-0-10 medallion[810]: File "/usr/local/lib/python3.8/dist-packages/medallion/backends/mongodb_backend.py", line 117, in _get_object_manifest
May 16 11:31:44 ip-100-64-0-10 medallion[810]: next_id, more = self._update_record(next_id, count, internal)
May 16 11:31:44 ip-100-64-0-10 medallion[810]: File "/usr/local/lib/python3.8/dist-packages/medallion/backends/mongodb_backend.py", line 74, in _update_record
May 16 11:31:44 ip-100-64-0-10 medallion[810]: if self.pages[next_id]["skip"] >= count:
May 16 11:31:44 ip-100-64-0-10 medallion[810]: KeyError: '2721dba1-29c6-4372-bffc-6469fc974051'

This is a guess, but it looks related to paging, and I have seen paging errors caused by slowness and a smaller session timeout. If session timeout (session_timeout backend setting) is small for example, and the mongo querying takes too long, the paging record can disappear while the querying is happening. The code isn't written to account for paging records disappearing at awkward times. An easy test would be to increase the timeout and see if the error goes away or changes.

Regarding general slowness, I think the mongo backend is inefficient. One misconception people have is that if no filters are given, that no filtering should happen, and queries should be simple and fast. But that's not how TAXII works. TAXII requires that certain default filtering happen. So you can improve performance by adding filters which effectively disable the default filtering. Default filters are defined for version and spec_version (which are include only the latest versions). You can speed things up by using match[version]=all and match[spec_version]=2.1 (or whatever your desired spec version(s) are). The former allows the server to skip figuring out what the latest version is of each object, and the latter allows a similar skip for spec_version. You might try it and see if it helps.

@shubhamcyble
Copy link
Author

shubhamcyble commented May 18, 2022

@rpiazza Indexing seems like a good option. Let me try and revert with the response.

@chisholm I'd disagree on the inefficiency of mongodb. It can handle millions of records with and without filters. Mongodb takes it quite smoothly. Something is wrong with the query that has been built for it. I'm using for spec_version 2.1 and version is also same across the entire DB, so it will make no difference in adding an additional filter.

One more observation, after a stix object is pushed to a mongodb collection. Open the collection and check the type of _manifest.date_added. It should be of type date or timestamp, but is is inserted as a double. Not sure, but it could be one reason for causing slowness.

If you plan to change the DB, then TSDB can be a better option for a data which has to be queried using a date range. Elastic Search implementation can also make it super fast.

@chisholm
Copy link
Contributor

@chisholm I'd disagree on the inefficiency of mongodb. It can handle millions of records with and without filters. Mongodb takes it quite smoothly. Something is wrong with the query that has been built for it.

That's not what I meant. The medallion architecture includes a separation between the web part (using flask) which handles HTTP requests and responses, and a "backend" which handles the data. There are two backends included, one for mongo (which interfaces with mongodb) and an in-memory backend (doesn't use any database). I meant the mongo backend is inefficient.

I'm using for spec_version 2.1 and version is also same across the entire DB, so it will make no difference in adding an additional filter.

If you don't use the "all" version filter, the mongo backend will still go through the process of finding the most recent versions of all the objects. That adds overhead.

One more observation, after a stix object is pushed to a mongodb collection. Open the collection and check the type of _manifest.date_added. It should be of type date or timestamp, but is is inserted as a double. Not sure, but it could be one reason for causing slowness.

As far as I understand, that was done intentionally because the builtin timestamp type didn't support microsecond precision.

@shubhamcyble
Copy link
Author

shubhamcyble commented May 18, 2022

@rpiazza I tried adding index to the mongodb. Post that I again tried to poll the data with limits as well as filters.
Here's the stack trace:

Traceback (most recent call last):
File "/<VENV_PATH>/python3.8/site-packages/flask/app.py", line 2095, in call
return self.wsgi_app(environ, start_response)
File "/<VENV_PATH>/python3.8/site-packages/flask/app.py", line 2080, in wsgi_app
response = self.handle_exception(e)
File "/<VENV_PATH>/python3.8/site-packages/flask/app.py", line 2077, in wsgi_app
response = self.full_dispatch_request()
File "/<VENV_PATH>/python3.8/site-packages/flask/app.py", line 1525, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/<VENV_PATH>/python3.8/site-packages/flask/app.py", line 1523, in full_dispatch_request
rv = self.dispatch_request()
File "/<VENV_PATH>/python3.8/site-packages/flask/app.py", line 1509, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
File "/<VENV_PATH>/python3.8/site-packages/flask_httpauth.py", line 172, in decorated
return self.ensure_sync(f)(*args, **kwargs)
File "/<VENV_PATH>/cti-taxii-server/medallion/views/objects.py", line 115, in get_or_add_objects
objects, headers = current_app.medallion_backend.get_objects(
File "/<VENV_PATH>/cti-taxii-server/medallion/backends/mongodb_backend.py", line 30, in api_wrapper
return func(*args, **kwargs)
File "/<VENV_PATH>/cti-taxii-server/medallion/backends/mongodb_backend.py", line 260, in get_objects
count, objects_found = full_filter.process_filter(
File "/<VENV_PATH>/cti-taxii-server/medallion/filters/mongodb_filter.py", line 119, in process_filter
count = self.get_result_count(pipeline, data)
File "/<VENV_PATH>/cti-taxii-server/medallion/filters/mongodb_filter.py", line 138, in get_result_count
count_result = list(data.aggregate(count_pipeline))
File "/<VENV_PATH>/python3.8/site-packages/pymongo/collection.py", line 2402, in aggregate
return self._aggregate(
File "/<VENV_PATH>/python3.8/site-packages/pymongo/collection.py", line 2309, in _aggregate
return self.__database.client._retryable_read(
File "/<VENV_PATH>/python3.8/site-packages/pymongo/mongo_client.py", line 1371, in _retryable_read
return func(session, server, sock_info, read_pref)
File "/<VENV_PATH>/python3.8/site-packages/pymongo/aggregation.py", line 140, in get_cursor
result = sock_info.command(
File "/<VENV_PATH>/python3.8/site-packages/pymongo/pool.py", line 770, in command
self._raise_connection_failure(error)
File "/<VENV_PATH>/python3.8/site-packages/pymongo/pool.py", line 743, in command
return command(
File "/<VENV_PATH>/python3.8/site-packages/pymongo/network.py", line 134, in command
message._raise_document_too_large(name, size, max_bson_size + message._COMMAND_OVERHEAD)
File "/<VENV_PATH>/python3.8/site-packages/pymongo/message.py", line 1015, in _raise_document_too_large
raise DocumentTooLarge("%r command document too large" % (operation,))
pymongo.errors.DocumentTooLarge: 'aggregate' command document too large

@slunak
Copy link

slunak commented May 18, 2022

@rpiazza thanks for the reply.

I used initialization script from test data of this repo which already includes adding indexes to the db. After checking, indexes are indeed already added:

`rs0:PRIMARY> db.objects.getIndexes()
[

{
	"v" : 4,
	"key" : {
		"_id" : 1
	},
	"name" : "_id_",
	"ns" : "trustgroup1.objects"
},
{
	"v" : 4,
	"key" : {
		"id" : 1
	},
	"name" : "id_1",
	"ns" : "trustgroup1.objects"
},
{
	"v" : 4,
	"key" : {
		"type" : 1
	},
	"name" : "type_1",
	"ns" : "trustgroup1.objects"
},
{
	"v" : 4,
	"key" : {
		"_manifest.date_added" : 1
	},
	"name" : "_manifest.date_added_1",
	"ns" : "trustgroup1.objects"
},
{
	"v" : 4,
	"key" : {
		"_manifest.version" : 1
	},
	"name" : "_manifest.version_1",
	"ns" : "trustgroup1.objects"
},
{
	"v" : 4,
	"key" : {
		"_collection_id" : 1
	},
	"name" : "_collection_id_1",
	"ns" : "trustgroup1.objects"
},
{
	"v" : 4,
	"key" : {
		"_manifest.media_type" : 1,
		"_manifest.date_added" : 1
	},
	"name" : "_manifest.media_type_1__manifest.date_added_1",
	"ns" : "trustgroup1.objects"
},
{
	"v" : 4,
	"key" : {
		"_manifest.media_type" : 1,
		"_manifest.version" : 1
	},
	"name" : "_manifest.media_type_1__manifest.version_1",
	"ns" : "trustgroup1.objects"
},
{
	"v" : 4,
	"key" : {
		"_collection_id" : 1,
		"_manifest.date_added" : 1
	},
	"name" : "_collection_id_1__manifest.date_added_1",
	"ns" : "trustgroup1.objects"
}

]
rs0:PRIMARY>`

@slunak
Copy link

slunak commented May 18, 2022

Regarding general slowness, I think the mongo backend is inefficient. One misconception people have is that if no filters are given, that no filtering should happen, and queries should be simple and fast. But that's not how TAXII works. TAXII requires that certain default filtering happen. So you can improve performance by adding filters which effectively disable the default filtering. Default filters are defined for version and spec_version (which are include only the latest versions). You can speed things up by using match[version]=all and match[spec_version]=2.1 (or whatever your desired spec version(s) are). The former allows the server to skip figuring out what the latest version is of each object, and the latter allows a similar skip for spec_version. You might try it and see if it helps.

Interesting thought, I will test it with our client. However, if this is a root cause of the issue, a fix should be applied to the server because normally there is no control over clients that do not use matching filters.

UPDATE:

Using both match[version]=all and match[spec_version]=2.1 filters indeed helps. However, as said above, there is no control over clients if they choose not to use these filters. Therefore, the fix should be applied on the server.

@shubhamcyble
Copy link
Author

@chisholm / @rpiazza Any update when the fix for this issue can be planned for development?

@rpiazza
Copy link
Contributor

rpiazza commented May 20, 2022

As of a few weeks ago, we are in the process of transferring maintenance of all of these OASIS Open STIX/TAXII utilities to others. We still monitor these repositories, and would be happy to contribute a fix for a minor bug, but this appears to be a much more involved change. One of the issues is that this is a reference implementation of the TAXII server - a proof of concept - and not intended to be used in production. The new maintainers might be able to do more (@ejratl can you respond?). Also, there are other open-source TAXII server implementations available, which may be more robust.

@ejratl
Copy link

ejratl commented May 20, 2022

@rpiazza Unfortunately, I am not one of the new maintainers of TAXII, perhaps @sparrell can answer instead. (We are also still in need of Co-Chairs for the TAXII WG, in case anyone is interested.)

@sparrell
Copy link

sparrell commented May 22, 2022 via email

@sparrell
Copy link

sparrell commented May 22, 2022 via email

@sparrell
Copy link

sparrell commented May 22, 2022 via email

@rpiazza
Copy link
Contributor

rpiazza commented May 23, 2022

HI @sparrell -
You are correct - MITRE is no longer the maintainers of any of the projects in the Open-OASIS repository, although the READMEs are a little "behind the times". That doesn't mean we won't comment on issues, or even fix a small bug, as any other community member can.

"Maintainers" doesn't really have a definition in this context. For the most part, it just means one has permission to merge in a PR. Our sponsor supported MITRE to develop and fix these prototypes, but feels that the community should take the responsibility from now on. As with most things, it is hard to get people to do something they aren't paid to do.

That being said, the way I understand it, you have volunteered for exactly what you stated above: governance of PRs and evaluating PRs to meet our guidelines. If you want to take it further then that - it is entirely up to you.

@Halk0
Copy link

Halk0 commented Jan 31, 2023

Commenting here too since i have a feeling my pull request #174 might resolve this one too

I already raised another issue, since i encountered a hard limit that prevented accessing any objects at all if the collection grew too big, so i felt like it deserved its own issue. I rewrote a bit of the filtering logic of the mongodb backend. These changes increased performance quite alot. Paging is also handled in a way now, that it should not result in 502 errors either due to the
mongodb filtering. I have a feeling this was happening for the same reason that caused the bug mentioned in issue #173

Hoping some of the maintainers would get notified this way and the review process could proceed, or at least if still relevant, people in this issue could try out the changes in that pull request if it solves their problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants