Domain Adaptations #17

astrung · 2015-10-08T08:50:03Z

I have read all README.md files,and all papers.But i can not find any instructions or tutorials to build an easy application(even i know for some general steps:Questions analysis,Answer Producers,..). If i have created some data(unstructured & structured),some natural language process models (NER,POS,....),how to use it with Yodaqa??can you explain for me or write a tutorial for all??

pasky · 2015-10-08T16:11:14Z

Hi, there is unfortunately currently no process established for this, just a theoretical possibility that this could be done relatively non-painfully. Some pioneers need to explore this further personally. :-)

If you want to add new corpus of knowledge to YodaQA, some steps I've recently explained in an email to someone:

First, I recommend skimming over doc/UIMA and doc/HIGHLEVEL. If
you've read some paper about YodaQA, doc/HIGHLEVEL will just repeat that
in a different way, but should have more consistent terminology with
what's used in the code itself.

Building of the main pipeline is in cz.brmlab.yodaqa.pipeline.YodaQA.
The pipeline is quite long, but the essential part (from your POV) is
the "answer production" stage.

First, it'd be good to know more about what kind of corpora are you
adding. Is it a text corpus? Is it document-subject-oriented, or
something like textbooks? Do you want to replace enwiki or add another
corpus next to it?

The easiest thing to do is to just import it in Solr (see
data/enwiki/README for some instructions) and modify the endpoint
URL variable near the top of cz.brmlab.yodaqa.pipeline.YodaQA
to use this instead of enwiki.

However, with enwiki, we rely heavily on the fact that there is
a document per subject. There are two packages for enwiki search,

cz.brmlab.yodaqa.pipeline.solrdoc for generating answers from
document titles when we have a fulltext match
cz.brmlab.yodaqa.pipeline.solrfull generates answers from matched
passages, but with two strategies, either searching in document titles
and using the first sentence (intro to the named subject, biographical
data etc.) or searching in the text as well and using passages that
contain the keywords (this might be the only thing you want to do if
your corpus has no good structure).

To do search in addition to enwiki instead of replacing it, we'll need
to modify some code to perform multiple lookups; might make most sense
to add the solr endpoint as parameter of the SolrFullPrimarySearch UIMA
component. But I recommend trying just replacing it first.

If you have some linguistic models trained for your purposes, I think you should be able to specify them as resources to the DKpro annotators (or just replace the annotators with custom ones).

I realize this is not a detailed technical guide - it's currently not something for which we'd have a walkthrough, but with some cooperation with others, we could turn it into such a thing.

astrung · 2015-10-11T12:38:06Z

I have document-subject-oriented and relation database-subject-oriented(I have created it)(about a specific domain ).Now i want to replace Freebase and wiki(all databases,if possible).So we only change databases,or we need to change Analysic Engines or CAS structure ???

pasky · 2015-10-13T02:39:01Z

In that case, it should be enough to just change the databases for starters. Let us know how it went!

astrung · 2015-10-13T12:04:40Z

But QUESTION analysis will extract correctly for specific domain????

pasky · 2015-10-13T14:59:47Z

That's a good question. In principle, it will be doing something - but of course, it may not be perfect in recognizing named entities. Later, you can improve this using your own NER model. Also, entity linking will work better if you load up your list of entities to a lookup service.

But I think it's best to do this gradually and at the beginning just swap the knowledge bases for answer production.

Another thing that will help later is typing candidate answers. An example for recognizing biomedical terms (like protein names) using GeneOntology: 7a1389c

k0105 · 2015-11-23T02:15:59Z

Let me pick up this thread. I was the one Petr sent the explanation to. So far I have primarily been working on Watson (which turned out to be a good choice considering they will shut down the current example corpora for medicine and travel as well as the corresponding QA API in a month, so I got to play around with them) and my frontend, but I have also built a local minicluster with an octocore and two quadcore boxes with 24, 20 and 8GB of RAM respectively. They run YodaQA completely offline and I replaced the enwiki Solr DB with an example corpus [plaintext papers, not TODs (Title Oriented Documents)], which doesn't work at all 😁

So over the next weeks and especially early next year I will try to make it useful. If someone has additional ideas what to do next, I'm all ears.
@astrung Furthermore, I'm interested in your progress, astrung. Have you been able to make any progress on this over the last month? Is there anything you'd want to team up for? It seems like we might have some common goals...
@pasky I'll send you a report real soon. Ideally, you'll have it when you wake up in about 3 hours. [Update: Done.]

Oh yeah, one more thing: Creating the Freebase node has been running for 29 days on an i7 quadcore with 24GB RAM now. It still seems to progress steadily, though (1.25 billion DB entries so far). Should I worry? Everything else took more or less exactly the time specified in the READMEs, but this thing is just through the roof.

Best wishes,
Joe

pasky · 2015-11-27T17:01:18Z

Hi! Are you importing Freebase to Fuseki? Is it stuck on CPU or IO? Is the rate constant or slowing down? It's surprising that it's taking so long. Are you in the initial import phase or indexing phase? We may want to start another issue to discuss that, though.

k0105 · 2015-11-28T01:08:34Z

Update: Freebase setup is solved, cmp. #26

k0105 · 2016-01-15T04:16:27Z

While I haven't worked on this for a while I now have almost exactly 5000 papers collected (and converted) for "my" domain. With the REST backend functional [did you already have time to look at it?], my new knowledge of how JBT works and 4 weeks unassigned time until my deadline I kinda ask myself how far I could get with the domain adaptation. I'm playing with a few ideas from previous discussions:

a) Term extraction to replace the label matching. [I currently only have a glossary I wrote myself with 300 terms.] Could even do this with a UIMA pipeline.
b) Disabling the headline strategy in cz.brmlab.yodaqa.pipeline.solrfull, because papers aren't TODs.
c) LATByJBT as an analog to LATByWordnet.

Do you think this would make sense (in this order)? Is this the correct bottom line and prioritization from our other discussions?

PS:
Can be skipped, just some other stuff I did since Monday.
[[I also did some little fun tests in the mean time - wrote a random forest classifier that classifies questions into domain-specific vs general (93% with 100-fold cross-validation; my supervisor was quite dissatisfied, however, because I only have 500 questions in my dataset so far, and made me scrap it). Furthermore, I wrote a web interface, which is in a prealpha state (only a search field and button, an answer list with confidences and an animated side panel that shows details; has nothing to do with the YodaQA integration we threw together in 5 minutes I sent you, btw).]]

pasky · 2016-01-18T00:18:16Z

While I haven't worked on this for a while I now have almost exactly 5000 papers collected (and converted) for "my" domain. With the REST backend functional [did you already have time to look at it?], my new knowledge of how JBT works and 4 weeks unassigned time until my deadline I kinda ask myself how far I could get with the domain adaptation. I'm playing with a few ideas from previous discussions:

a) Term extraction to replace the label matching. [I currently only have a glossary I wrote myself with 300 terms.] Could even do this with a UIMA pipeline.

I'm not quite sure what do you mean here, sorry.

b) Disabling the headline strategy in cz.brmlab.yodaqa.pipeline.solrfull, because papers aren't TODs.

So, for TOD, we have the strategy that takes first sentence in .solrfull
plus the .solrdoc strategy, and we need to disable these two for
a non-TOD scenario, that's right.

c) LATByJBT as an analog to LATByWordnet.

Totally, I'm very curious about how this would work in practice.

I'm not sure what's the best way to organize things. Right now, for
various "domains" of YodaQA we have seaprate branches in the d/
namespace (d/movies and d/live are actively maintained). So the easiest
way would be probably to make a d/custom branch, that would have:

(b)
I guess also disabled lookup in structured knowledge bases, as
that's 90% the case for specific domains I think
appropriately retrained scoring model, still using enwiki and large2180
documentation on how to make YodaQA answer queries about your
collection of documents (possibly assuming finished solr import)

Does that make sense?

OTOH we want (c) in master as well.

(In the long run, cross-merging all the branches is kind of pain,
especially if you develop against one of the d/ branches as is very
common for me on d/movies lately - I test on top of d/movies, then
cherry pick back to master, eew. We could also start track multiple
scoring models and have some config files, but that'd be more effort.)

k0105 · 2016-01-18T02:03:59Z

I'll need some time to think about everything else you said. But about (a): I thought that having some keyphrases should be helpful for QA, since for a new domain we don't know what words or multiwords describe relevant concepts. I then tried to write down a list of key concepts myself, got about 300 concepts and saw that this isn't enough. Hence, in order to acquire this vocabulary automatically, I wrote a keyphrase extraction: I convert my documents into decent plaintext Strings, make sure they are English by applying some language detection, then extract keyphrases with one of several backends (currently AlchemyAPI or a local implementation) and then do some cleanup and majority voting to find out which keyphrases are relevant for the entire corpus. This way I get multiple megabytes worth of key phrases out of my 5000 documents. For instance, for e-learning papers I would get concepts like "distance learning", "double loop learning", "LaaN theory", "educational data mining" etc. That should be helpful for domain adaptation, right?

My first idea was to put these concepts into a label service like the Fuzzy Label Lookup Service you use for DBPedia. But this might be naive, because I haven't really looked into what it does in detail, yet.

pasky · 2016-01-19T00:14:59Z

Oh, having the code to do this would be awesome, this sounds pretty neat!

The step we perform in YodaQA right now that I might have omitted in the discussion above and is related to this is Concept Linking, and the way it's done and used is a bit tied to TOD as well:

Certain substrings of the question are looked up via the "label-lookup" infrastructure (two services, one performing fuzzy lookup that allows some edits, another one perfomring normal lookup in crosswikis dataset) which detects if the substrings refer to some concept (which means enwiki article here)
If a link (enwiki pageId associated with the substring) is established (which involves the above + a classifier to prune candidates), the substring is promoted to a high-priority clue that also carries over the link
The resulting clue is used for search as any other clue, maybe with higher priority; it's useful for grouping words together that otherwise might not be, e.g. "The Simpsons" or "Marge Simpson" go together as keyphrases because they would be linked to concepts; adjecent clue matches also matter when extracting answers from phrases so when we have more accurate clues, it helps scoring the answers better too
The link is used (i) to directory fetch the article as an extra fulltext source aside of solr search results; (ii) to use concept properties in RDF knowledge base as candidate answer

So, when we don't have a TOD corpus, we should disable label-lookup by default!

But if we get your mechanism, we wouldn't have the link, but still would benefit from it due to better clues. How important it is is hard to say, it's probably not revolutionary but it should certainly have some impact. I think the best way would be to reuse the label-lookup infrastructure (the fuzzy part, though if you have a way to detect synonymous references to the same concept, that'd belong to the crosswikis part), but provide a null link.

Makes sense? I think it's not the toppest priority necessarily though. But if you need a reason to get the code for what you described out there, I'd totally be for it. :-)

k0105 · 2016-01-19T12:39:31Z

Oh, having the code to do this would be awesome, this sounds pretty neat!

Sure. Once I'm done with my paper in early April we'll have to sift through all my stuff and see what you want and how to integrate it. Also: Thanks for the extensive answers, they are very helpful and I highly appreciate it.

k0105 · 2016-01-31T21:11:48Z

I just got a dump from a wiki about my domain and I'm considering an attempt to put this into YodaQA. Transforming the Mediawiki dump XML format into Solr's XML format is fairly trivial (esp. since I've done that before with my papers) and, similarly, creating the static labels should be straightforward by generating sorted_list.dat myself. Let's pick an arbitrary example from the current file: !Ay, caramba! %C2%A1Ay,_caramba! 1880521 0 !Ay, Carmela! (film) %C2%A1Ay_Carmela! 10558683 0 [%C2%A1 is URL encoded unicode for '¡', the numbers are Wikipedia's curid and the last number apparently says whether the title belongs to an article directly (1) or is a redirect (0).

But: I just read the paper about Sqlite labels ("A Cross-Lingual Dictionary for English Wikipedia Concepts") and this seems way more involved. The authors even explicitly state in their summary that "[t]he dictionary [...] would be difficult to reconstruct in a university setting".

Do you think it's worth a shot despite of all this? This is neither essential for my paper nor do I really have any time left for it, but I would really like to have the capability of adding arbitrary wiki dumps to YodaQA. I mean we finally have the distributional semantics, the key phrase extraction, the converted corpora and the TODs - I would really like to see it all come together now. Any hints or remarks to tackle this? What should I do with the extracted keyphrases (that I could also expand with my distributional semantics model to e.g. get similar terms) - can I just throw them in the label service or do anything clever with them without modifying the UIMA annotators [which I don't have time for until my deadline]?

Thanks in advance.

Best wishes,
Joe

pasky · 2016-02-02T01:12:42Z

Hi! I think you are setting your goals maybe too high for your initial work on this. :) The fuzzy labels dataset is just to make it more robust to different spellings, nicknames etc., and by default that's extracted from Wikipedia corpus - but if you don't need that, it's fine to just include mapping from concept name to its canonical article by id, without anything more involved. I think maybe the most elegant solution would be writing a script that takes Solr-import XML and generates the labels dataset to load into labels-lookup based on that. This should be pretty trivial and an universal solution for whatever corpora anyone can massage to the XML dump format. Does that sound sane?

k0105 · 2016-02-05T09:03:09Z

Yes, this is exactly the minimal solution I've been thinking about. So I wrote the wiki parser - it only needs one pass over a MediaWiki XML dump, filters out all the offtopic pages and simultaneously both adds the remaining TODs to a label file as well as to a Solr XML input file in plaintext with any MediaWiki or HTML markup removed (well, in theory, but it is reasonably clean). So far, so good. Solr and the label service work, I took my offline version of Yoda 1.4 I still had around (standard 1.4 release with online backends replaced by localhost), but I currently get:

*** http://localhost:5000 or http://localhost:5001 label lookup query (temporarily?) failed, retrying in a moment...
java.net.ConnectException: Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:589)
        at java.net.Socket.connect(Socket.java:538)
        at sun.net.NetworkClient.doConnect(NetworkClient.java:180)
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
        at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)
        at sun.net.www.http.HttpClient.New(HttpClient.java:308)
        at sun.net.www.http.HttpClient.New(HttpClient.java:326)
        at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
        at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
        at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
        at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1513)
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1441)
        at cz.brmlab.yodaqa.provider.rdf.DBpediaTitles.queryCrossWikiLookup(DBpediaTitles.java:287)
        at cz.brmlab.yodaqa.provider.rdf.DBpediaTitles.query(DBpediaTitles.java:100)
        at cz.brmlab.yodaqa.analysis.question.CluesToConcepts.process(CluesToConcepts.java:97)
        at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48)
        at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:385)
        at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:309)
        at cz.brmlab.yodaqa.flow.asb.MultiprocessingAnalysisEngine_MultiplierOk.processAndOutputNewCASes(MultiprocessingAnalysisEngine_MultiplierOk.java:218)
        at cz.brmlab.yodaqa.flow.asb.MultiThreadASB$AggregateCasIterator$1.call(MultiThreadASB.java:772)
        at cz.brmlab.yodaqa.flow.asb.MultiThreadASB$AggregateCasIterator$1.call(MultiThreadASB.java:754)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

which is strange, because the label service reports actual requests, for instance:

::ffff:127.0.0.1 - - [05/Feb/2016 09:50:02] "GET /search/Cognitivism?ver=1 HTTP/1.1" 200 -
searching Cognitivism
found:
[{'canonLabel': 'Cognitivism', 'dist': 0, 'name': 'Cognitivism', 'matchedLabel': 'Cognitivism', 'prob': '0'}, {'canonLabel': 'Connectivism', 'dist': 3, 'name': 'Connectivism', 'matchedLabel': 'Connectivism', 'prob': '0'}]

I don't run any backends besides Solr and one label service - is it trying to reach the SQLite backend on 5001 (which is not there) or what is happening?

Update:
OK, so I modified the code to have Freebase and DBpedia on two different ports, started them accordingly and then just ran everything (2 label services, Freebase, Solr, DBpedia and Yoda with Solr and one label service using the data from my wiki dump). It now gets further (which means that Yoda has a problem when there is only one label service - it then blocks with java.net.ConnectException: Connection refused), but it still throws errors:

5970d004988310949685ff957/de.tudarmstadt.ukp.dkpro.core.api.lexmorph-asl-1.7.0.jar!/de/tudarmstadt/ukp/dkpro/core/api/lexmorph/tagset/en-ptb-pos.map
INFO ResourceObjectProviderBase - Producing resource took 0ms
Feb 05, 2016 10:29:17 AM org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl callAnalysisComponentProcess(417)
SEVERE: Exception occurred
org.apache.uima.analysis_engine.AnalysisEngineProcessException: Annotator processing failed.    
        at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:401)
        at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:309)
        at cz.brmlab.yodaqa.flow.asb.MultiprocessingAnalysisEngine_MultiplierOk.processAndOutputNewCASes(MultiprocessingAnalysisEngine_MultiplierOk.java:218)
        at cz.brmlab.yodaqa.flow.asb.MultiThreadASB$AggregateCasIterator$1.call(MultiThreadASB.java:772)
        at cz.brmlab.yodaqa.flow.asb.MultiThreadASB$AggregateCasIterator$1.call(MultiThreadASB.java:754)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
        at java.util.concurrent.ConcurrentHashMap.putVal(ConcurrentHashMap.java:1011)
        at java.util.concurrent.ConcurrentHashMap.put(ConcurrentHashMap.java:1006)
        at cz.brmlab.yodaqa.analysis.tycor.LATNormalize.process(LATNormalize.java:145)
        at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48)
        at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:385)
        ... 8 more

Feb 05, 2016 10:29:17 AM org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl processAndOutputNewCASes(274)
SEVERE: Exception occurred
org.apache.uima.analysis_engine.AnalysisEngineProcessException
        at cz.brmlab.yodaqa.flow.asb.MultiThreadASB$AggregateCasIterator.processUntilNextOutputCas(MultiThreadASB.java:1074)
        at cz.brmlab.yodaqa.flow.asb.MultiThreadASB$AggregateCasIterator.<init>(MultiThreadASB.java:496)
        at cz.brmlab.yodaqa.flow.asb.MultiThreadASB.process(MultiThreadASB.java:416)
        at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.processAndOutputNewCASes(AggregateAnalysisEngine_impl.java:266)
        at cz.brmlab.yodaqa.flow.asb.MultiThreadASB$AggregateCasIterator$1.call(MultiThreadASB.java:772)
        at cz.brmlab.yodaqa.flow.asb.MultiThreadASB$AggregateCasIterator$1.call(MultiThreadASB.java:754)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.ExecutionException: org.apache.uima.analysis_engine.AnalysisEngineProcessException: Annotator processing failed.    
        at java.util.concurrent.FutureTask.report(FutureTask.java:122)
        at java.util.concurrent.FutureTask.get(FutureTask.java:192)
        at cz.brmlab.yodaqa.flow.asb.MultiThreadASB$AggregateCasIterator.collectCasInFlow(MultiThreadASB.java:814)
        at cz.brmlab.yodaqa.flow.asb.MultiThreadASB$AggregateCasIterator.casInFlowFromFuture(MultiThreadASB.java:603)
        at cz.brmlab.yodaqa.flow.asb.MultiThreadASB$AggregateCasIterator.nextCasToProcess(MultiThreadASB.java:716)
        at cz.brmlab.yodaqa.flow.asb.MultiThreadASB$AggregateCasIterator.processUntilNextOutputCas(MultiThreadASB.java:1039)
        ... 9 more
Caused by: org.apache.uima.analysis_engine.AnalysisEngineProcessException: Annotator processing failed.    
        at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:401)
        at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:309)
        at cz.brmlab.yodaqa.flow.asb.MultiprocessingAnalysisEngine_MultiplierOk.processAndOutputNewCASes(MultiprocessingAnalysisEngine_MultiplierOk.java:218)
        ... 6 more
Caused by: java.lang.NullPointerException
        at java.util.concurrent.ConcurrentHashMap.putVal(ConcurrentHashMap.java:1011)
        at java.util.concurrent.ConcurrentHashMap.put(ConcurrentHashMap.java:1006)
        at cz.brmlab.yodaqa.analysis.tycor.LATNormalize.process(LATNormalize.java:145)
        at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48)
        at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:385)
        ... 8 more

Oh, one more thing: Of course, I made sure that everything else is correct: When I replace the Solr file and the label file with the Wikipedia ones, YodaQA runs perfectly and fully local.

pasky · 2016-02-06T01:11:23Z

That's right, it's trying to reach the SQLite backend. Either disable it in YodaQA code, or run one with empty database.

k0105 · 2016-02-24T16:00:29Z

Just for the record: While running with an empty DB doesn't work for me, deactivating the second label service in the YodaQA code (just return an empty list instead of performing the actual lookup) does work. Hence, I might have the first (admittedly rather naive) domain adaptation: The system can now use the Edutech-Wiki content. Definitions work fairly well (What is e-learning, mobile learning, blended learning etc.) and when one is willing to consider the Top 5 even questions like "Which learning theory did John Dewey [or Piaget etc.] contribute to?" are often answered correctly. While it is certainly not ready for productive use in research and the recommended articles still link to Wikipedia instead of the other wiki, it seems like a decent stepping stone for more involved domain adaptation. I guess the next step is to get my hands dirty and introduce my JBT backend to the mix. Ain't gonna be pretty. but someone needs to do it.

PS:
One more thing - I've benchmarked my local Yoda instance. SSDs make it about 2.8x as fast, but the quality of the SSD didn't matter - a consumer SSD with about 550MB/s led to results just as fast as a professional SSD over PCIe with effective 1500MB/s. Of course, you'll find a detailed report in my paper.

pasky · 2016-02-24T23:24:35Z

Awesome work! When you are able to do so, it would be awesome if you could publish your code or step-by-step guide for importing the Edutech-Wiki and using it for YodaQA.

k0105 · 2016-02-25T18:17:25Z

Sure. I'm busy for the next 9 days, but afterwards I'll just finish my paper and send you the whole thing including detailed reports on everything we talked about.

tanmayb123 · 2016-04-27T02:13:16Z

@k0105 Any update yet? Thanks.

k0105 · 2016-05-15T22:53:49Z

We've played with the topic classifiers. Again, simplicity triumphed. While random forests with tf/idf, word lemmatization and snowball stemming work well, I always wanted to try SVMs on this and indeed: While the former solution yields around 92%, the latter gives 94-5%. You can find the code here: https://github.com/yunshengb/QueryClassifier/blob/master/sklearn/train.py

I'm currently reimplementing the ensemble so it's independent of JavaFX (almost done) and then I'll move over to the wiki parser and keyphrase extraction. I'll most likely just upload it to Github. Slowly but surely I'm getting through the material.

k0105 · 2016-06-14T23:34:09Z

@nazeer1 So, have you been able to hook up your own triple store? That sounds pretty interesting...

You can switch DBpedia and Freebase to another Fuseki backend by setting system properties - try -Dcz.brmlab.yodaqa.dbpediaurl="http://yoursource:1234/yoursource/query" If you get something decent, please post a quick summary of your findings. I'd love to read it.

ghost · 2016-06-15T11:09:24Z

@k0105 thanks for your comment. I dont know on which path shall I run your comment ? I uploaded my RDFs into Jena fuseki, the backend URL is : "http://localhost:3030/dbpedia/query". I changed it with dbpedia and freebase URLs in all classes in "provider.rdf" package. I also updated the "fuzzyLookupUrl" and "crosswikiLookupUrl" with my URL also. and then I had to comment everything in "DBpediaTypesTest.groovy" class. now when I run the web view and search for query it keeps searching and I get this error message in teminal :
*** http://localhost:3030/dbpedia/query or http://localhost:3030/dbpedia/query label lookup query (temporarily?) failed, retrying in a moment...
it is from "DBpediaTitles" class.

can you please give me a hit ? , or if am doing something wrong.
I want the yodaqa to search only in my domain which is RDF, I want all other domains to be deactivated. still don't know how is that possible.

pasky · 2016-06-15T16:14:22Z

If you simply substitute dbpedia URL with your RDF domain, that will not affect just the knowledge base used to generate answers, but also further answer scoring components, which you might not want to do, and if you do want to, you'll have to rewrite classes like DBpediaTitles.

So, the easiest route is to keep DBpedia answer scoring components at first, and change just the answer generator component, which is DBpediaOntology (or DBpediaProperties, they are almost identical); clone the DBpediaLookup class to MyKBLookup class with appropriately modified URL, and clone DBpediaProperties to MyKBProperties, with SPARQL query that is appropriate to your knowledge base. Finally, clone DBpediaProperty* classes in pipeline.structured package to MyKBProperty* classes, using your rdf provider instead, and put them in the main YodaQA class pipeline code instead of the previous knowledge bases.

If you want to perform more complex question answering than just a direct entity attribute, you should look at FreebaseOntology instead, but it's more complex code too.

Finally, if you want to answer questions only from knowledge bases rather than from a combination of kb and unstructured texts, base your work on the d/movies branch rather than master.

HTH

ghost · 2016-07-11T11:14:01Z

Thanks @pasky , @k0105 for your hints, I still have some problem with adapting my domain with yodaqa :
At first I would like to ask weather if YodaQA can answer queries from set of simple RDFs , or do I need to build an ontology first ?
“ I have a RDF files which shows some information about set of objects”

For another: for changing the domain I have done the following steps and face following problems, I would really appreciate it if you can give me a hint how to solve them:
*First I followed the instruction in ReadMe file at dbpedia, so I uploaded dbpedia 2014 on my localhost "apache-Jena-Fuseki" successfully, then I created the MyKB classes as per your instruction, it works at first, but when I comment any of the "DBpediaOntologyAnswerProducer" or "FreebaseOntologyAnswerProducer" endpint in YodaQA pipeline and I run any query it will remain in searching process and shows this message in terminal:
INFO LATByWordnet - ?! word *..... l of POS NNS not in Wordnet.

*** Then when I added some of my data to DBpedia RDF files for testing, "keeping the dbpedia structure so I wont need to change the Sparql code" it also remain in process for long time showing none of the data that I added and shows this message in terminal.
INFO FocusGenerator - ?. No focus in:

****For the last I would like to ask if I should change this two URL in DBpediaTitles.java class,
protected static final String fuzzyLookupUrl = "http://dbp-labels.ailao.eu:5000";
protected static final String crossWikiLookupUrl = "http://dbp-labels.ailao.eu:5001";
since am using apache jena I dont know which url would be equivalent to it. my query endpint is "http://localhost:3030/dbpedia/query".

Thanks in advance .

k0105 · 2016-07-13T03:50:13Z

There are no such entries in DBpediaTitles anymore, cmp. https://github.com/brmson/yodaqa/blob/master/src/main/java/cz/brmlab/yodaqa/provider/rdf/DBpediaTitles.java and yeah, I'd change those. As aforementioned, you should disable the second labeling service by returning an empty list and parse your data to create an input file for the first label service. For the other questions, I have to refer to Petr, since he is much more knowledgeable about Yoda than I am.

ghost · 2016-07-15T12:16:13Z

Many Many thanks @k0105 for information, well I am using d/movies . there sill I have the old version of code in "DBpediaTitles", but I did change the code with new one from the link that you mentioned. now it is more clear to me and will try to disable second label and parse my data into the first lable as you said, hope it works. I will update my results with you ASAP.
Regards,

k0105 · 2016-07-15T19:16:12Z

Please do - if you really manage to connect your own triple store and get a working domain adaptation that would be quite interesting to read about.

pasky added the question label Jan 12, 2016

pasky mentioned this issue Apr 20, 2016

YodaQA in non-English languages #45

Open

pasky changed the title ~~where can i find a tutorial to build a specific domain application???~~ Doman Adaptations Apr 29, 2016

pasky changed the title ~~Doman Adaptations~~ Domain Adaptations Apr 29, 2016

pasky mentioned this issue Jun 3, 2016

changing the repository #50

Closed

This was referenced Sep 21, 2016

Multiple IDs conflict when adding to solr #61

Closed

How to add more fulltext sources #62

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Domain Adaptations #17

Domain Adaptations #17

astrung commented Oct 8, 2015

pasky commented Oct 8, 2015

astrung commented Oct 11, 2015

pasky commented Oct 13, 2015

astrung commented Oct 13, 2015

pasky commented Oct 13, 2015

k0105 commented Nov 23, 2015

pasky commented Nov 27, 2015

k0105 commented Nov 28, 2015

k0105 commented Jan 15, 2016

pasky commented Jan 18, 2016

k0105 commented Jan 18, 2016

pasky commented Jan 19, 2016

k0105 commented Jan 19, 2016

k0105 commented Jan 31, 2016

pasky commented Feb 2, 2016 via email

k0105 commented Feb 5, 2016

pasky commented Feb 6, 2016 via email

k0105 commented Feb 24, 2016

pasky commented Feb 24, 2016 via email

k0105 commented Feb 25, 2016

tanmayb123 commented Apr 27, 2016

k0105 commented May 15, 2016 •

edited

Loading

k0105 commented Jun 14, 2016

ghost commented Jun 15, 2016

pasky commented Jun 15, 2016

ghost commented Jul 11, 2016

k0105 commented Jul 13, 2016

ghost commented Jul 15, 2016

k0105 commented Jul 15, 2016

Domain Adaptations #17

Domain Adaptations #17

Comments

astrung commented Oct 8, 2015

pasky commented Oct 8, 2015

astrung commented Oct 11, 2015

pasky commented Oct 13, 2015

astrung commented Oct 13, 2015

pasky commented Oct 13, 2015

k0105 commented Nov 23, 2015

pasky commented Nov 27, 2015

k0105 commented Nov 28, 2015

k0105 commented Jan 15, 2016

pasky commented Jan 18, 2016

k0105 commented Jan 18, 2016

pasky commented Jan 19, 2016

k0105 commented Jan 19, 2016

k0105 commented Jan 31, 2016

pasky commented Feb 2, 2016 via email

k0105 commented Feb 5, 2016

pasky commented Feb 6, 2016 via email

k0105 commented Feb 24, 2016

pasky commented Feb 24, 2016 via email

k0105 commented Feb 25, 2016

tanmayb123 commented Apr 27, 2016

k0105 commented May 15, 2016 • edited Loading

k0105 commented Jun 14, 2016

ghost commented Jun 15, 2016

pasky commented Jun 15, 2016

ghost commented Jul 11, 2016

k0105 commented Jul 13, 2016

ghost commented Jul 15, 2016

k0105 commented Jul 15, 2016

k0105 commented May 15, 2016 •

edited

Loading