Skip to content

edisonlz/suffixTree_ch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SuffixTree --- A Suffix Tree library for Python Chinese

Conspirator: Danny Yoo ([email protected])

This is a SWIG wrapper around Dan Gusfield's 'strmat' suffix tree library.

http://www.cs.ucdavis.edu/~gusfield/strmat.html

Suffix trees allow for very powerful string matching, and are used quite a bit in many elegant string algorithms. Since this is a wrapper around strmat.stree, most of the documentation in doc/stree.doc should apply.

Here are the modules included in the SuffixTree package:

o SuffixTree.SuffixTree -- The suffix tree structure.  This is a
  thin wrapper around strmat's stree data structure.  This isn't a
  complete wrapper yet; I need to find some time to complete this.
  The wrapper appears to be good enough for simple stuff.

  Methods of SuffixTree:

      o SuffixTree(alphabet=STREE_ASCII)

          Construct a new SuffixTree.  By default, the alphabet
          used by the SuffixTree is ASCII.  Other choices include
          STREE_DNA, STREE_RNA, and STREE_PROTEIN.

      o add(string, id)

          Adds a string to the suffix tree with an id.

      o root()

          Returns the root() SuffixNode of the tree.

      o num_nodes():

          Returns the total number of nodes held in the tree.

      o match(string)

          Given a string, traverse the suffix tree and return a
          3-tuple (match_length, suffix_node, endpos)
          
          

o SuffixTree.SuffixNode  (I need to fix the documentation here)

    Methods of 
    num_children()
    find_child(char ch)
    children()
    next()
    parent()
    suffix_link()
    edgelen()
    edgestr()
    getch()
    labellen()
    labelstr()
    ident()
    num_leaves()
    leaf(int leafnum)



o SuffixTree.SubstringDict -- An application of suffix trees toward
  substring matching.  An example might help:

  >>> #coding=utf-8
  >>> from SuffixTree import SubstringDict
  

  >>> sd = SubstringDict()
  >>> sd.__setitem__("我是python程序员",1)
  >>> sd.__setitem__("我是ruby程序员",2)
  >>> sd.__setitem__("我是javascript程序员",3)
  >>> sd.__setitem__("我是android程序员",4)
  >>> sd.__setitem__("我还是DBA",4)
  >>> print sd[“我是”]
  >>> print sd[“我还是”]



  >>> sd = SubstringDict()
  >>> sd["我是python程序员"] = 1
  >>> sd["我是ruby程序员"] = 2
  >>> sd["我是javascript程序员"] = 3
  >>> sd["我是android程序员"] = 4
  >>> sd["我还是DBA"] = 5
  >>> print sd[“我还是”]


  SubstringDict provides a mapping that allows for substrings of
  keys.  The keys do need to be strings though.

  支持中文的方式是使用 base64,数据量回增加30%,对性能回有些损耗,但是,损耗不大

  64 位 安装 :
  ARCHFLAGS="-arch i386 -arch x86_64" python setup.py install

About

python suffixTree support chinese

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published