Skip to content

Latest commit

 

History

History
49 lines (42 loc) · 2.61 KB

README.md

File metadata and controls

49 lines (42 loc) · 2.61 KB

Semantic Rider

Milestones

  • v0.0.1: First version of tab search plugin
  • v0.0.2: Initial version with cleanup popup UI
  • v0.0.3: Refactored code, for ease of testing + Extendability (In Progress)
  • v0.0.4: TBD

Running the Program

Basic

  1. Ensure that your virtual env is setup and activated (assuming you are at base of repo)
    • virtualenv venv
    • source venv/bin/activate
    • pip install -r serv/requirements.txt
    • export PYTHONPATH=$PYTHONPATH: pwd
  2. Ensure that you have fetched all files
    • Create serv/res folder if one is not there, and get meta_train_*.pkl and embed_train_*.pkl inside res
      • git lfs pull
  3. Please download encoder model from https://drive.google.com/file/d/1JLTYMaCtY4pkl4oeygXVnk_GxJpOWxKH/view?usp=drive_link and put it in the server folder.

Testing It

  1. Ensure that all files model_file, ort_format file etc are available.

    • You gan git lfs pull the files
    • Or you can get all files by doing git lfs pull
  2. Now run the program

    • Do python serv/test_algo.py from base path, this should give you 90% accuracy
    • This will take about 10-12 hrs if you don't have embed_data and meta_file or your are reindexing, else about 1min

Using It

  1. Install the plugin in chrome in developer mode. To do this see below

  2. Running the server

  • Activate virtual environment as mentioned in step 1 of Basic
  • Now run the flask server program, with python serv/server.py
  • If the program successfully runs, you should see text of any site you visited being displayed as the output of previous step
  • Now if you search in semrider plugin, it should return you results

Data (note all - to be replaced with _ in filenames)

  1. data/eval-100-samples.csv : 100 urls from 10 categories with phrases to match categories like llm-blog, tech blog etc. Used to check accuracy across categ
  2. data/confsbl-hn-url-gt-100.csv : These provide consfuable data to confuse the 15k YC news, to confuse above 100
  3. res/meta-train-v02.pkl : meta data for top 1k of confsbl + 100 evals
  4. res/embed-train-v02.pkl : embed data for top 1k of confsbl + 100 evals
  5. You can use the embed-train/meta-train as prod as well, just make a copy and call it embed-prod-v02 and meta-prod-v02.pkl