GlobalQuran is a well written eco-system of UI pages, widgets, APIs, search that anyone can use on their site.

This page is just to explore & evaluate new DataSets that may or may not end up being used in GlobalQuran.

  •  [Word2Word data]
  •  [Buck data]
  •  [Corpus grammar data]
  •  [Transliteration data]

  • Word2Word data:
    • Objective: For each quran word, get the english meaning. No compound word meanings.
    • Reliability: *** (3/5) almost as good as source
    • Demo: []  -- see the 4th column
    • Challenges: as its machine matched, data may not be 100% accurate. For compound words, most meanings are prefixed with **. These are where the meaning is duplicated for side-by-side words.
    • How data generated?: 
      • started with the orig data from [] 
      • snip out all meanings from js. Ex: []
      • Identify all places where theres mismatch in #words in tanzil and the meaning segments.
      • where mismatch, either fixed by hand in few cases. or duplicate the meaning after prepend **.
    • Note: Apparently and also had to go thru these word2word data cleanups.

  • Buck data:
    • ​Objective: Instead of passing Tanzil Quran data as unicode, pass it as ascii. One-to-one mapping of arabic unicode to a ascii, which can be mapped & remapped, without loss of fidelity.
    • Reliability: **** (4/5) as good as Tanzil, the source
    • Demo: []  -- see 3rd column
    • Advantages?
      • Buck uses less bandwidth
      • In Javascript, u can search thru entire Buck quran text in 1 shot. intuitive compared to Arabic search
      • Buck to Arabic and Arabic to Buck is a simple js call. Play with live sample here: [].
      • You can strip out all vowels from Buck text in few millisecs. Why do this? u can search in javascript, ignoring the taskheel differences (fatha, damma, kasra). They leads to more hits.
      • Regex + buck text can lead to awesome optimizations. All the searches can be run locally. Demo: []
    • How data generated? just one-to-one mapping using: []

  • Corpus Grammar data:
    • ​Objective: get grammatical data for each word.
    • Reliability: ** (2/5)  Corpus is a work-in-progress. 40% complete.Will have mistakes!
    • Demo: []  -- see last column.
      • [] you can do local live in-browser searches, similar to website. can Download & browse locally.
    • Advantages? only source which has raw data published. 
    • Challenges: grammatical data is NOT at word level; its at word-segment level. Each word= n*PREFIX + 1 or 2 STEMS + n*SUFFIX. 
    • Columns not uniform. i.e.
    • demo relies on using heavyduty Javacript RegExs to cut out info for word & to parse out grammar info. May want to do 1-time premapping into ur own format, but will have to redo if corpus releases new data file.

  • Transliteration data based on Corpus:
    • Not Done. It shud be simple 1-to-1 mapping using []
    • Alternate approach: Used a different data file which has tranliteration encoded in html format. using jquery and regex allows me to do search based on transliteration locally. (no server call)
    • Demo: [] type in 2:255

Comments / Questions??


blog comments powered by Disqus