Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Scheduler accepts offer for insufficient disk resources, fails to detect CREATE failure #418

Open
dylanwilder opened this issue Mar 16, 2017 · 2 comments

Comments

@dylanwilder
Copy link
Contributor

dylanwilder commented Mar 16, 2017

Seems to be two separate issues here. Mesos agent is configured with ~15Gb ROOT disk resource, cassandra is looking for 20Gb. See offer here:

INFO  [2017-03-14 20:42:39,688] com.mesosphere.dcos.cassandra.scheduler.CassandraScheduler: Received Offer: id { value: "ec7ab46c-8786-43aa-8bab-148ea8f9a872-O36443" } framework_id { value: "ec7ab46c-8786-43aa-8bab-148ea8f9a872-0002" } slave_id 
{ value: "cf0e92c8-2784-4613-a873-5c936a02eb70-S50" } hostname: "10.X.X.50" resources { name: "cpus" type: SCALAR scalar { value: 32.8 } role: "*" } resources { name: "mem" type: SCALAR scalar { value: 245176.0 } role: "*" } resources { name:
 "disk" type: SCALAR scalar { value: 15275.0 } role: "*" } resources { name: "disk" type: SCALAR scalar { value: 675867.0 } role: "*" disk { source { type: PATH path { root: "/mnt/data1" } } } } resources { name: "disk" type: SCALAR scalar { val
ue: 675867.0 } role: "*" disk { source { type: MOUNT mount { root: "/mnt/data2" } } } } resources { name: "disk" type: SCALAR scalar { value: 675867.0 } role: "*" disk { source { type: PATH path { root: "/mnt/data3" } } } } resources { name: "di
sk" type: SCALAR scalar { value: 675867.0 } role: "*" disk { source { type: MOUNT mount { root: "/mnt/data4" } } } } resources { name: "disk" type: SCALAR scalar { value: 675867.0 } role: "*" disk { source { type: PATH path { root: "/mnt/data5" 
} } } } resources { name: "disk" type: SCALAR scalar { value: 675867.0 } role: "*" disk { source { type: MOUNT mount { root: "/mnt/data6" } } } } resources { name: "ports" type: RANGES ranges { range { begin: 31000 end: 31028 } range { begin: 31
030 end: 31295 } range { begin: 31298 end: 32305 } range { begin: 32307 end: 33000 } } role: "*" } attributes { name: "nfs" type: TEXT text { value: "group1" } } attributes { name: "dnsHostname" type: TEXT text { value: "dny1-bvlt-r1n11" } } att
ributes { name: "rack" type: TEXT text { value: "r1" } } attributes { name: "diskType" type: TEXT text { value: "SSD" } } attributes { name: "ipAddress" type: TEXT text { value: "10.X.X.50" } } url { scheme: "http" address { hostname: "10.X
.X.50" ip: "10.X.X.50" port: 5051 } path: "/slave(1)" }

Cassandra decides to accept this insufficient offer:

INFO  [2017-03-14 20:42:39,776] org.apache.mesos.offer.MesosResourcePool: Retrieving resource for reservation
INFO  [2017-03-14 20:42:39,776] org.apache.mesos.offer.OfferEvaluator: Satisfying resource requirement: name: "disk" type: SCALAR scalar { value: 20480.0 } role: "cassandra.storage" disk { persistence { id: "" principal: "cassandra.storage" } volume { container_path: "volume" mode: RW } } reservation { principal: "cassandra.storage" labels { labels { key: "resource_id" value: "" } } }
with resource: name: "disk" type: SCALAR scalar { value: 20480.0 } role: "*"
INFO  [2017-03-14 20:42:39,777] org.apache.mesos.offer.OfferEvaluator: Reserves Resource
INFO  [2017-03-14 20:42:39,777] org.apache.mesos.offer.OfferEvaluator: Creates Volume
INFO  [2017-03-14 20:42:39,778] org.apache.mesos.offer.OfferEvaluator: Fulfilled resource: name: "disk" type: SCALAR scalar { value: 20480.0 } role: "cassandra.storage" disk { persistence { id: "ec53c7c6-fe3d-4b16-8b14-cf98b5fa03e4" principal: "cassandra.storage" } volume { container_path: "volume" mode: RW } } reservation { principal: "cassandra.storage" labels { labels { key: "resource_id" value: "f686ff09-c058-4a0a-9d69-a9bf04111c7c" } } }
...
INFO  [2017-03-14 20:42:39,790] org.apache.mesos.offer.OfferAccepter: Performing Operation: type: RESERVE reserve { resources { name: "disk" type: SCALAR scalar { value: 20480.0 } role: "cassandra.storage" reservation { principal: "cassandra.sto
rage" labels { labels { key: "resource_id" value: "f686ff09-c058-4a0a-9d69-a9bf04111c7c" } } } } }
...
INFO  [2017-03-14 20:42:39,793] org.apache.mesos.offer.OfferAccepter: Performing Operation: type: CREATE create { volumes { name: "disk" type: SCALAR scalar { value: 20480.0 } role: "cassandra.storage" disk { persistence { id: "ec53c7c6-fe3d-4b1
6-8b14-cf98b5fa03e4" principal: "cassandra.storage" } volume { container_path: "volume" mode: RW } } reservation { principal: "cassandra.storage" labels { labels { key: "resource_id" value: "f686ff09-c058-4a0a-9d69-a9bf04111c7c" } } } } }

But after launching receives failed notification from master

INFO  [2017-03-14 20:42:39,869] INFO  [2017-03-14 20:42:39,869] com.mesosphere.dcos.cassandra.scheduler.CassandraScheduler: Received status update for taskId=node-0__dd62e043-d6a3-4fbb-8400-07c7af0da107 state=TASK_ERROR source=SOURCE_MASTER reason=REASON_TASK_INVALID message='Task uses more resources cpus(cassandra.storage, cassandra.storage, {resource_id: c9e73072-7792-4d9e-a5b1-6bfb269ec12c}):4; mem(cassandra.storage, cassandra.storage, {resource_id: fb6cdba9-246c-449f-a84a-2943e676ff08}):10240; disk(cassandra.storage, cassandra.storage, {resource_id: f686ff09-c058-4a0a-9d69-a9bf04111c7c})[ec53c7c6-fe3d-4b16-8b14-cf98b5fa03e4:volume]:20480; ports(cassandra.storage, cassandra.storage, {resource_id: 8ce72836-4b31-46ab-8c0a-cfbe92f17f31}):[31990-31994]; cpus(cassandra.storage, cassandra.storage, {resource_id: b5fc6e9b-b90b-40f9-a8a3-d20568d27b12}):0.1; mem(cassandra.storage, cassandra.storage, {resource_id: 16fce9a1-6652-4035-837f-d13acf3ee453}):768; ports(cassandra.storage, cassandra.storage, {resource_id: 2919035c-32e2-4233-9e3b-b8373b711d12}):[31995-31995] than available cpus(*):27.7; mem(*):233912; disk(*):15275; disk(*)[]:675867; disk(*)[]:675867; disk(*)[]:675867; disk(*)[]:675867; disk(*)[]:675867; disk(*)[]:675867; ports(*):[31000-31028, 31030-31295, 31298-31989, 31996-32305, 32307-33000]; cpus(cassandra.storage, cassandra.storage, {resource_id: b5fc6e9b-b90b-40f9-a8a3-d20568d27b12}):0.1; mem(cassandra.storage, cassandra.storage, {resource_id: 16fce9a1-6652-4035-837f-d13acf3ee453}):768; ports(cassandra.storage, cassandra.storage, {resource_id: 2919035c-32e2-4233-9e3b-b8373b711d12}):[31995-31995]; cpus(cassandra.storage, cassandra.storage, {resource_id: c9e73072-7792-4d9e-a5b1-6bfb269ec12c}):4; mem(cassandra.storage, cassandra.storage, {resource_id: fb6cdba9-246c-449f-a84a-2943e676ff08}):10240; ports(cassandra.storage, cassandra.storage, {resource_id: 8ce72836-4b31-46ab-8c0a-cfbe92f17f31}):[31990-31994]; cpus(cassandra.storage, cassandra.storage, {resource_id: 0cf86d71-ac9f-4242-bd3b-f4862bf91a12}):1; mem(cassandra.storage, cassandra.storage, {resource_id: 77c294be-0129-4c4b-bd9f-1269697f2c7b}):256'

And finally on attempting to relaunch is unable to as it cannot find the non existent peristence id

INFO  [2017-03-14 20:42:41,746] org.apache.mesos.offer.MesosResourcePool: Retrieving reserved resource
WARN  [2017-03-14 20:42:41,746] org.apache.mesos.offer.MesosResourcePool: Failed to find reserved resource: f686ff09-c058-4a0a-9d69-a9bf04111c7c, in available resources: [0cf86d71-ac9f-4242-bd3b-f4862bf91a12, 77c294be-0129-4c4b-bd9f-1269697f2c7b, 8ce72836-4b31-46ab-8c0a-cfbe92f17f31]
WARN  [2017-03-14 20:42:41,747] org.apache.mesos.offer.OfferEvaluator: Failed to satisfy resource requirement: name: "disk" type: SCALAR scalar { value: 20480.0 } role: "cassandra.storage" disk { persistence { id: "ec53c7c6-fe3d-4b16-8b14-cf98b5fa03e4" principal: "cassandra.storage" } volume { container_path: "volume" mode: RW } } reservation { principal: "cassandra.storage" labels { labels { key: "resource_id" value: "f686ff09-c058-4a0a-9d69-a9bf04111c7c" } } }

From the master logs:

Mar 14 20:42:39 dny1-bvlt-r1n16 mesos-master[7950]: E0314 20:42:39.860085  7998 master.cpp:1955] Dropping CREATE offer operation from framework ec7ab46c-8786-43aa-8bab-148ea8f9a872-0002 (cassandra-s4) at [email protected]:46523: Invalid CREATE Operation: Insufficient disk resources
@mrbrowning
Copy link
Contributor

Hi Dylan, I tried to reproduce this with an analogous setup: agents offering 36GB of ROOT disk space, 40GB of MOUNT disk space, and with the Cassandra scheduler set up to expect 38GB of ROOT disk space (running scheduler version 1.0.25-3.0.10). I saw the expected behavior, which is that all incoming offers were rejected and no task launch or volume creation was attempted. Can you give some more details about your setup? DC/OS version, Cassandra version, scheduler configuration on launch?

@triclambert
Copy link
Contributor

This repo is deprecated and will be archived in one week. Please see the latest version of Cassandra or DSE for DC/OS:

https://docs.mesosphere.com/service-docs/cassandra/
https://docs.mesosphere.com/service-docs/dse/ (enterprise-only)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants