Skip to content
This repository has been archived by the owner on Jul 16, 2020. It is now read-only.

Weekly Meeting 2016 12 07

Kristen Carlson Accardi edited this page Dec 8, 2016 · 1 revision

Agenda

##Minutes

  • obedmr has started a project for his container based ciao development environment https://github.com/obedmr/docker-ciao
  • We are behind on our bug triage and never ever get to the scrub section of the agenda
    • it was agreed that the project lead (kristenc) would become the bug triager and triage offline, freeing our meetings to discuss other things.
  • We will have no meetings for the rest of the year due to holidays.

###Full IRC Log

<kristenc> ok meeting is started.
<kristenc> #topic Role call
<jvillalo> o/
* anunez9 (86868b46@gateway/web/freenode/ip.134.134.139.70) has joined
<tpepper1> o/
<btwarden> o/
<mcastelino> o/
<markusry> o/
<rbradford> o/
<mrkz> o/
<kristenc> #topic Opens
<kristenc> anyone with opens please chime in.
* p4tux has quit (Remote host closed the connection)
<obedmr> a quick one, if someone wants to take a look on ciao on top of docker containers, I'm creating this      initial project https://github.com/obedmr/docker-ciao
<kristenc> obedmr, this is your developer mode setup?
<obedmr> kristenc: yes
<kristenc> ok, we have gotten behind in our bug triaging.
<kristenc> here's this week's issues.
<kristenc> https://github.com/01org/ciao/issues?utf8=%E2%9C%93&q=is%3Aissue%20is%3Aopen%20created%3A%3E%3D2016-12-01
<kristenc> we can do last weeks if we finish these.
* pixelgeek (cnorman@nat/intel/x-xbcbiwzvzngngycj) has joined
<kristenc> #910 - need bat tests for external IPs. I was thinking this was a p2. I'm not sure how to do this outside of singlevm.
<kristenc> it is tested currently in verify.sh
<tpepper1> it'd be useful to figure out, since that will inform us how an operator would use the feature in the real world
<markusry> So we couldn't run them on the release cluster
<markusry> right now
<markusry> ?
<mcastelino> kristenc, we should  be able to test with BAT on single VM
<kristenc> you need to have control over the pool I think.
<mcastelino> but if we run on a real cluster... there will be cluster specific setup for routing
<kristenc> mcastelino, so - we could potentially add them in, but disable them
<tpepper1> to test in single VM, what IP pool do you use?
<mcastelino> we use a virtual pool... we do that today
<mcastelino> and setup routes for that pool
<mcastelino> that pool is contained to be within the machine
<mcastelino> as the machine hosts the whole cluster
<tpepper1> i see
<kristenc> if our release cluster were moved to a private network, then we could add the tests.
<kristenc> and document how to do it for others.
<kristenc> we could as a first step add the tests and only run in singlevm.
* piet ([email protected]) has joined
<kristenc> ok if nobody objects, P2?
<kristenc> #912
<kristenc> this is true - but my feeling is it's a p3 for now.
<mcastelino> tpepper1, for reference      https://github.com/01org/ciao/blob/master/testutil/singlevm/verify.sh#L258
<mcastelino> this is how we setup the pool routes in single VM
<tpepper1> makes sense
<kristenc> #916 - I agree, I hate the error handling in ciao-cli.
<markusry> agree.
<kristenc> the stack trace useful if you are a programmer.
<markusry> It's not even that useful
<markusry> as ciao-cli is so small
<tpepper1> in this case I don't think it's useful
<tpepper1> the printf of the error says what's wrong.  the trace doesn't contribute
<markusry> each command corresponds to a function
<btwarden> The bigger issue is that the underlying service is returning 500 in that case, so there's not enough information for ciao-cli to formulate a reasonable response, right?
<markusry> so it's easy to figure out where the problem must be
<kristenc> well - you could very easily change the code to dump a stack trace with -v=2 or something, but otherwise just do nicer errors.
<tpepper1> btwarden: yes.  500 should lead to a message like "instance does not exist"
<kristenc> btwarden, our error response codes from the compute api leads much to be desired.
<kristenc> however, cli can respond more gracefully.
<markusry> I don't think the stack trace is useful in ciao-cli
<markusry> If you got the stack trace of controller that would be interesting
<btwarden> Seems like it should either be a 400- or 200- response in that case.
<markusry> But you don't get that
<mcastelino> markusry, we do concatenate errors when we generate errors...
<rbradford> we can shove a message into the 500 code
<kristenc> a not found should be a 400.
<rbradford> or whatever code
<kristenc> 40x
<kristenc> it's a separate bug though.
<btwarden> 404 not found, 410 gone...
<tpepper1> is 500 the not auth'd?
<kristenc> the error response should already contain an error code.
<kristenc> tpepper1, no 500 is the generic server error code.
<rbradford> tpepper1, "Internal service error"
<mcastelino> markusry, can we report the concatenated error as an event
<kristenc> i think the instance code in cli doesn't report the error message right either.
<mcastelino> is there a way to co-relate the API call to event
<btwarden> 500 usually suggests there's a problem internal to the underlying service
<tpepper1> hm.  so does the compute api return 500 when the instance is not found?
<kristenc> i'd have to check.
<rbradford> https://golang.org/pkg/net/http/#Error
<rbradford> we should shove our error message into the message
<tpepper1> well.  in this case I'm less concerned with the specific error than the ugly trace
<kristenc> rbradford, we do already.
<tpepper1> there may be 29 other issues in error handling besides the trace
<markusry> There are multiple issues here
<rbradford> kristenc, oh, i've seen some places where that's not the case, but in this case I see that it is there.
<kristenc> so - this issue specifically is about how unhelpful the stack trace is.
<tpepper1> yes that was my intent
<markusry> 916 is about the traces output from ciao-cli that don't have any useful purpose
<markusry> Right
<kristenc> rbradford, I noticed recently there's a bug in cli where they don't print the returned error message.
<tpepper1> i retitled the issues
<tpepper1> issue
* kristenc forgot to file though
<kristenc> thanks - so for this issue, I would vote for P2 or P3.
<markusry> At least a P2 for me
<markusry> It's quite a bad user experience
<kristenc> ok - P2.
<kristenc> #917
<kristenc> singlevm runs all services as root.
<kristenc> we used to be able to run most ciao services without root.
<kristenc> then we had to change this - it was because of needing to access /var, right?
<mcastelino> kristenc, we need root for networking netlink calls and qemu
<mcastelino> the qemu issue is easier to solve
<kristenc> for launcher.
<markusry> kristenc: we could make it work by just chown /var/lib/ciao
<kristenc> ?
<markusry> in single VM
<markusry> setup.sh I mean
<markusry> I entered a separate bug for launcher
<markusry> So for me we should be able to fix this by creating /var/lib/ciao with the correct permissions
<tpepper1> isn't there an older issue for having ciao components run as a ciao user/group with them owning /var/lib/ciao?
<markusry> and then running controller and scheduler as non-root
<btwarden> Or can we even create it under ~/local instead?
<markusry> That's the other option
<tpepper1> doing /var/lib/ciao right is good for real usage and singlevm
<tpepper1> I'd prefer that route
<tpepper1> otherwise you "fix" just singlevm
<markusry> controller now allows you to select your prefix
<kristenc> it's be nice to fix the real install to not need to run the services as root either.
<markusry> We could do this for the other components
<tpepper1> also if the perms are wrong and glog fails to open the specified file it falls back to /tmp
<tpepper1> which might constitute a leak of info depening on the +r mode bits there
* p4tux ([email protected]) has joined
<kristenc> ok - so for priority - p2?
<markusry> okay for me
<kristenc> #918 then would be the same priority?
<markusry> Sounds good to me
<kristenc> #920 - I don't think we'll be moving away from sqlite3 in the near term (next 3 months).
<kristenc> rbradford, what is the current code coverage for sqlite code?
<tpepper1> 72 ish
<rbradford> yup
<kristenc> I think it's at a sufficient level then to not prioritize this highly, as long as any new code comes with unit tests.
<rbradford> there is one area that has no coverage
<rbradford> which is instances
<rbradford> i mean, i don't think they're an important part....
<rbradford> :-)
<kristenc> heh. well- the good news on that is we do get a lot of code coverage with bat tests on that.
<kristenc> so a breakage would likely be caught.
<rbradford> sure
<kristenc> i think we should just be careful if we touch any of the instances related code in sqlite3 to add a unit test to make sure we don't regress.
<kristenc> so I am thinking of a p3 for this.
<tpepper1> I'd label janitorial and p2 it
<tpepper1> would be a nice area to bulk up coverage and get the nice line by line views the tools give you
<kristenc> we'd need to establish a criteria for closing it.
<tpepper1> could guide / prioritize further additions or pragmatic skips
<kristenc> 100% coverage probably isn't realistic.
<tpepper1> if the issue says pragmatically add coverage for key mutations of instances, I'm happy
<tpepper1> not sure if you consider that "criteria"
<kristenc> ok - let's make this one about adding unit tests specifically for instances?
<kristenc> well - we can leave it, I don't want to spend our time on it now.
<kristenc> #921 
* Patifa ([email protected]) has joined
<kristenc> this issue is pretty broad, but I would make it a p3 janitorial.
<kristenc> parts should probably be broken out into separate issues when someone has the bandwidth to work on it.
<rbradford> good plan
* Patifa has quit (Client Quit)
<kristenc> #924
<kristenc> albertom, would you say this is a p1 bug for you?
<obedmr> I'd say yes
<albertom> its a dup
<albertom> of
<albertom> https://github.com/01org/ciao/issues/943
<albertom> which already has a fix
<kristenc> ok - we'll leave it as a p1, when you confirm the fix for 943 fixes this problem you can close it.
<albertom> ok
<kristenc> #928 - markusry can we close this? I've not had any problems lately.
<kristenc> although sadly I can't just go check the status and see. I will tomorrow.
<markusry> Yes, let's close it
<kristenc> ok.
<markusry> I'll re-open it if it re-appears
<kristenc> #929
<kristenc> this is a continuation of a feature.
<mrkz> obedmr: ^
<obedmr> yesp
<obedmr> yes, next sprint task
<kristenc> yes - that is what I was thinking.
<kristenc> so I would say it's a p1, but a feature not a bug, and we'll schedule it into the next sprint.
<obedmr> kristenc: sure
<obedmr> agree
<kristenc> #940 this is a dup of one we used to have that we closed because we all agreed storage should be bat tests only.
* mrkz has to run
<tpepper1> I  must've forgotten we agreed that
<kristenc> ok - sorry, we are running long. let's wrap up #940 then end this meeting.
<markusry> I try to figure out if I can compute the coverage of the BAT tests
* anunez9 has quit (Quit: Page closed)
<markusry> I'll I mean
<tpepper1> if we can figure that I bet we get a nice boost in computed coverage
* p4tux has quit (Remote host closed the connection)
<tpepper1> versus the coveralls reported number
<kristenc> so - do we need to revist the decision to have ciao-storage be unit tested? or should we close #940 wontfix?
<rbradford> tpepper1 adds some unit tests in his PR
* p4tux ([email protected]) has joined
<tpepper1> I suppose things that require a server to function are either bat-only on need testutil fakery
<tpepper1> the noop driver is trivially easy to move to 100% coverage :)
<kristenc> yeah - that was why we opted for bat only.
<kristenc> I stand by the original decision :). it would be nice to confirm that our test coverage is good with bat though.
<tpepper1> the only issue with that is we don't get a gate on it
<tpepper1> add broken code, don't test it in bat, nothing notices
<kristenc> something notices eventually. release would fail.
<kristenc> the storage bats are run nightly.
<tpepper1> that insures we don't regress for things covered by bat
<tpepper1> we don't have a gate to insure bat coverage isn't small and shrinking
<kristenc> that part is true - if mark can get coveralls to report on our bat coverage then code coverage % can be enforced with travis.
<tpepper1> yes
<tpepper1> so close it and we'll see where we gert
<tpepper1> get
<kristenc> ok.
<kristenc> 3 things before we quit.
<kristenc> 1) no meetings for dec 22,29
<kristenc> will most of the team be here next week?
<kristenc> or should we cancel that too?
* p4tux has quit (Ping timeout: 260 seconds)
<kristenc> also - we are really not keeping up with the bug triage.
<kristenc> I think we need to schedule a special meeting to get caught up.
<tpepper1> I'm here next week.  and am up for an extra triage meeting
<kristenc> ok - perhaps next week we could go 2 hours?
<rbradford> no mark and I
<kristenc> ah.
<kristenc> that's going to make it hard.
<kristenc> so - how about in Jan we dedicate a chunk of time to get caught up.
<kristenc> there will likely be weeks worth of issues to triage anyway due to the no meetings.
<kristenc> i wonder also if this is sustainable long term.
<kristenc> to try to have us triage as a group.
<kristenc> it was easy when there were a few issues per week.
* p4tux ([email protected]) has joined
<tpepper1> it would be nice to have time to discuss architecture and not just bugs
<kristenc> it becomes harder and harder with more issues piling up. it's be simpler to have a single person triage all the stuff.
<tpepper1> i think we need two meetings
<tpepper1> conversation around these things is important
<tpepper1> but we could also have a bug master like we have a merge master
<kristenc> personally, I'd rather just delegate bug triage to a person.
<kristenc> and let people complain later if they disagree.
<kristenc> i like the idea of a bug master.
<kristenc> does anyone think this is a bad idea?
<tpepper1> some portion of the bug triage is prioritizing work assignments.  should the team lead always be one of the bug masters?
<kristenc> yes - I can do it by myself for now. this is fine and makes sense to me if everyone is ok with it.
<kristenc> people will need to review the priorities at some point though to make sure I'm not off.
<tpepper1> would that happen if we did buddies like the merging? lead and a rotating buddy do triages?
<tpepper1> still implies a meeting if there's a 2nd person
<kristenc> tpepper1, yes - that is my hestitation. I want to make this easier and easier means less needing to coordinate with everyone.
<kristenc> if I did it myself, I could do it in 5 minutes a day.
<tpepper1> then let's stop talking and get doing :)
<kristenc> thanks. I am assuming silence from the rest of the crew means agreement.
<kristenc> so I will make it so.
<kristenc> that's actually very exciting. our meetings will now become utterly different.
<rbradford> cool
<kristenc> so next week - should we cancel since UK is out?
* Teresita-Warrior has quit (Quit: Leaving.)
<kristenc> I think we should.
<rbradford> i would recommend that
<kristenc> so unless people object, let's cancel next week as well. I'll update the wiki.
<kristenc> ok, this meeting is over.
<kristenc> I'll have to cut and paste the logs into the minutes later. sorry for the confusion with meetingbot, I wish I'd noticed it in time to fix it.
Clone this wiki locally