-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Domain Adaptations #17
Comments
Hi, there is unfortunately currently no process established for this, just a theoretical possibility that this could be done relatively non-painfully. Some pioneers need to explore this further personally. :-) If you want to add new corpus of knowledge to YodaQA, some steps I've recently explained in an email to someone: First, I recommend skimming over doc/UIMA and doc/HIGHLEVEL. If Building of the main pipeline is in cz.brmlab.yodaqa.pipeline.YodaQA. First, it'd be good to know more about what kind of corpora are you The easiest thing to do is to just import it in Solr (see However, with enwiki, we rely heavily on the fact that there is
If you have some linguistic models trained for your purposes, I think you should be able to specify them as resources to the DKpro annotators (or just replace the annotators with custom ones). I realize this is not a detailed technical guide - it's currently not something for which we'd have a walkthrough, but with some cooperation with others, we could turn it into such a thing. |
I have document-subject-oriented and relation database-subject-oriented(I have created it)(about a specific domain ).Now i want to replace Freebase and wiki(all databases,if possible).So we only change databases,or we need to change Analysic Engines or CAS structure ??? |
In that case, it should be enough to just change the databases for starters. Let us know how it went! |
But QUESTION analysis will extract correctly for specific domain???? |
That's a good question. In principle, it will be doing something - but of course, it may not be perfect in recognizing named entities. Later, you can improve this using your own NER model. Also, entity linking will work better if you load up your list of entities to a lookup service. But I think it's best to do this gradually and at the beginning just swap the knowledge bases for answer production. Another thing that will help later is typing candidate answers. An example for recognizing biomedical terms (like protein names) using GeneOntology: 7a1389c |
Let me pick up this thread. I was the one Petr sent the explanation to. So far I have primarily been working on Watson (which turned out to be a good choice considering they will shut down the current example corpora for medicine and travel as well as the corresponding QA API in a month, so I got to play around with them) and my frontend, but I have also built a local minicluster with an octocore and two quadcore boxes with 24, 20 and 8GB of RAM respectively. They run YodaQA completely offline and I replaced the enwiki Solr DB with an example corpus [plaintext papers, not TODs (Title Oriented Documents)], which doesn't work at all 😁 So over the next weeks and especially early next year I will try to make it useful. If someone has additional ideas what to do next, I'm all ears. Oh yeah, one more thing: Creating the Freebase node has been running for 29 days on an i7 quadcore with 24GB RAM now. It still seems to progress steadily, though (1.25 billion DB entries so far). Should I worry? Everything else took more or less exactly the time specified in the READMEs, but this thing is just through the roof. Best wishes, |
Hi! Are you importing Freebase to Fuseki? Is it stuck on CPU or IO? Is the rate constant or slowing down? It's surprising that it's taking so long. Are you in the initial import phase or indexing phase? We may want to start another issue to discuss that, though. |
Update: Freebase setup is solved, cmp. #26 |
While I haven't worked on this for a while I now have almost exactly 5000 papers collected (and converted) for "my" domain. With the REST backend functional [did you already have time to look at it?], my new knowledge of how JBT works and 4 weeks unassigned time until my deadline I kinda ask myself how far I could get with the domain adaptation. I'm playing with a few ideas from previous discussions: a) Term extraction to replace the label matching. [I currently only have a glossary I wrote myself with 300 terms.] Could even do this with a UIMA pipeline. Do you think this would make sense (in this order)? Is this the correct bottom line and prioritization from our other discussions? PS: |
I'm not quite sure what do you mean here, sorry.
So, for TOD, we have the strategy that takes first sentence in .solrfull
Totally, I'm very curious about how this would work in practice. I'm not sure what's the best way to organize things. Right now, for
Does that make sense? OTOH we want (c) in master as well. (In the long run, cross-merging all the branches is kind of pain, |
I'll need some time to think about everything else you said. But about (a): I thought that having some keyphrases should be helpful for QA, since for a new domain we don't know what words or multiwords describe relevant concepts. I then tried to write down a list of key concepts myself, got about 300 concepts and saw that this isn't enough. Hence, in order to acquire this vocabulary automatically, I wrote a keyphrase extraction: I convert my documents into decent plaintext Strings, make sure they are English by applying some language detection, then extract keyphrases with one of several backends (currently AlchemyAPI or a local implementation) and then do some cleanup and majority voting to find out which keyphrases are relevant for the entire corpus. This way I get multiple megabytes worth of key phrases out of my 5000 documents. For instance, for e-learning papers I would get concepts like "distance learning", "double loop learning", "LaaN theory", "educational data mining" etc. That should be helpful for domain adaptation, right? My first idea was to put these concepts into a label service like the Fuzzy Label Lookup Service you use for DBPedia. But this might be naive, because I haven't really looked into what it does in detail, yet. |
Oh, having the code to do this would be awesome, this sounds pretty neat! The step we perform in YodaQA right now that I might have omitted in the discussion above and is related to this is Concept Linking, and the way it's done and used is a bit tied to TOD as well:
So, when we don't have a TOD corpus, we should disable label-lookup by default! But if we get your mechanism, we wouldn't have the link, but still would benefit from it due to better clues. How important it is is hard to say, it's probably not revolutionary but it should certainly have some impact. I think the best way would be to reuse the label-lookup infrastructure (the fuzzy part, though if you have a way to detect synonymous references to the same concept, that'd belong to the crosswikis part), but provide a null link. Makes sense? I think it's not the toppest priority necessarily though. But if you need a reason to get the code for what you described out there, I'd totally be for it. :-) |
Sure. Once I'm done with my paper in early April we'll have to sift through all my stuff and see what you want and how to integrate it. Also: Thanks for the extensive answers, they are very helpful and I highly appreciate it. |
I just got a dump from a wiki about my domain and I'm considering an attempt to put this into YodaQA. Transforming the Mediawiki dump XML format into Solr's XML format is fairly trivial (esp. since I've done that before with my papers) and, similarly, creating the static labels should be straightforward by generating But: I just read the paper about Sqlite labels ("A Cross-Lingual Dictionary for English Wikipedia Concepts") and this seems way more involved. The authors even explicitly state in their summary that "[t]he dictionary [...] would be difficult to reconstruct in a university setting". Do you think it's worth a shot despite of all this? This is neither essential for my paper nor do I really have any time left for it, but I would really like to have the capability of adding arbitrary wiki dumps to YodaQA. I mean we finally have the distributional semantics, the key phrase extraction, the converted corpora and the TODs - I would really like to see it all come together now. Any hints or remarks to tackle this? What should I do with the extracted keyphrases (that I could also expand with my distributional semantics model to e.g. get similar terms) - can I just throw them in the label service or do anything clever with them without modifying the UIMA annotators [which I don't have time for until my deadline]? Thanks in advance. Best wishes, |
Hi! I think you are setting your goals maybe too high for your initial
work on this. :) The fuzzy labels dataset is just to make it more
robust to different spellings, nicknames etc., and by default that's
extracted from Wikipedia corpus - but if you don't need that, it's fine
to just include mapping from concept name to its canonical article by
id, without anything more involved.
I think maybe the most elegant solution would be writing a script that
takes Solr-import XML and generates the labels dataset to load into
labels-lookup based on that. This should be pretty trivial and an
universal solution for whatever corpora anyone can massage to the XML
dump format. Does that sound sane?
|
Yes, this is exactly the minimal solution I've been thinking about. So I wrote the wiki parser - it only needs one pass over a MediaWiki XML dump, filters out all the offtopic pages and simultaneously both adds the remaining TODs to a label file as well as to a Solr XML input file in plaintext with any MediaWiki or HTML markup removed (well, in theory, but it is reasonably clean). So far, so good. Solr and the label service work, I took my offline version of Yoda 1.4 I still had around (standard 1.4 release with online backends replaced by localhost), but I currently get:
which is strange, because the label service reports actual requests, for instance:
I don't run any backends besides Solr and one label service - is it trying to reach the SQLite backend on 5001 (which is not there) or what is happening? Update:
Oh, one more thing: Of course, I made sure that everything else is correct: When I replace the Solr file and the label file with the Wikipedia ones, YodaQA runs perfectly and fully local. |
That's right, it's trying to reach the SQLite backend. Either disable
it in YodaQA code, or run one with empty database.
|
Just for the record: While running with an empty DB doesn't work for me, deactivating the second label service in the YodaQA code (just return an empty list instead of performing the actual lookup) does work. Hence, I might have the first (admittedly rather naive) domain adaptation: The system can now use the Edutech-Wiki content. Definitions work fairly well (What is e-learning, mobile learning, blended learning etc.) and when one is willing to consider the Top 5 even questions like "Which learning theory did John Dewey [or Piaget etc.] contribute to?" are often answered correctly. While it is certainly not ready for productive use in research and the recommended articles still link to Wikipedia instead of the other wiki, it seems like a decent stepping stone for more involved domain adaptation. I guess the next step is to get my hands dirty and introduce my JBT backend to the mix. Ain't gonna be pretty. but someone needs to do it. PS: |
Awesome work! When you are able to do so, it would be awesome if you
could publish your code or step-by-step guide for importing the
Edutech-Wiki and using it for YodaQA.
|
Sure. I'm busy for the next 9 days, but afterwards I'll just finish my paper and send you the whole thing including detailed reports on everything we talked about. |
@k0105 Any update yet? Thanks. |
We've played with the topic classifiers. Again, simplicity triumphed. While random forests with tf/idf, word lemmatization and snowball stemming work well, I always wanted to try SVMs on this and indeed: While the former solution yields around 92%, the latter gives 94-5%. You can find the code here: https://github.com/yunshengb/QueryClassifier/blob/master/sklearn/train.py I'm currently reimplementing the ensemble so it's independent of JavaFX (almost done) and then I'll move over to the wiki parser and keyphrase extraction. I'll most likely just upload it to Github. Slowly but surely I'm getting through the material. |
@nazeer1 So, have you been able to hook up your own triple store? That sounds pretty interesting... You can switch DBpedia and Freebase to another Fuseki backend by setting system properties - try |
@k0105 thanks for your comment. I dont know on which path shall I run your comment ? I uploaded my RDFs into Jena fuseki, the backend URL is : "http://localhost:3030/dbpedia/query". I changed it with dbpedia and freebase URLs in all classes in "provider.rdf" package. I also updated the "fuzzyLookupUrl" and "crosswikiLookupUrl" with my URL also. and then I had to comment everything in "DBpediaTypesTest.groovy" class. now when I run the web view and search for query it keeps searching and I get this error message in teminal : can you please give me a hit ? , or if am doing something wrong. |
If you simply substitute dbpedia URL with your RDF domain, that will not affect just the knowledge base used to generate answers, but also further answer scoring components, which you might not want to do, and if you do want to, you'll have to rewrite classes like DBpediaTitles. So, the easiest route is to keep DBpedia answer scoring components at first, and change just the answer generator component, which is DBpediaOntology (or DBpediaProperties, they are almost identical); clone the DBpediaLookup class to MyKBLookup class with appropriately modified URL, and clone DBpediaProperties to MyKBProperties, with SPARQL query that is appropriate to your knowledge base. Finally, clone DBpediaProperty* classes in pipeline.structured package to MyKBProperty* classes, using your rdf provider instead, and put them in the main YodaQA class pipeline code instead of the previous knowledge bases. If you want to perform more complex question answering than just a direct entity attribute, you should look at FreebaseOntology instead, but it's more complex code too. Finally, if you want to answer questions only from knowledge bases rather than from a combination of kb and unstructured texts, base your work on the d/movies branch rather than master. HTH |
Thanks @pasky , @k0105 for your hints, I still have some problem with adapting my domain with yodaqa : For another: for changing the domain I have done the following steps and face following problems, I would really appreciate it if you can give me a hint how to solve them: *** Then when I added some of my data to DBpedia RDF files for testing, "keeping the dbpedia structure so I wont need to change the Sparql code" it also remain in process for long time showing none of the data that I added and shows this message in terminal. ****For the last I would like to ask if I should change this two URL in DBpediaTitles.java class, Thanks in advance . |
There are no such entries in DBpediaTitles anymore, cmp. https://github.com/brmson/yodaqa/blob/master/src/main/java/cz/brmlab/yodaqa/provider/rdf/DBpediaTitles.java and yeah, I'd change those. As aforementioned, you should disable the second labeling service by returning an empty list and parse your data to create an input file for the first label service. For the other questions, I have to refer to Petr, since he is much more knowledgeable about Yoda than I am. |
Many Many thanks @k0105 for information, well I am using d/movies . there sill I have the old version of code in "DBpediaTitles", but I did change the code with new one from the link that you mentioned. now it is more clear to me and will try to disable second label and parse my data into the first lable as you said, hope it works. I will update my results with you ASAP. |
Please do - if you really manage to connect your own triple store and get a working domain adaptation that would be quite interesting to read about. |
I have read all README.md files,and all papers.But i can not find any instructions or tutorials to build an easy application(even i know for some general steps:Questions analysis,Answer Producers,..). If i have created some data(unstructured & structured),some natural language process models (NER,POS,....),how to use it with Yodaqa??can you explain for me or write a tutorial for all??
The text was updated successfully, but these errors were encountered: