Line 1: Line 1:
 
'''GlobalQuran is a well written eco-system of UI pages, widgets, APIs, search that anyone can use on their site.'''
 
'''GlobalQuran is a well written eco-system of UI pages, widgets, APIs, search that anyone can use on their site.'''
  
This page is to explore & '''evaluate ''''''new DataSets''' that may or may not end up being used in GlobalQuran.
+
This page is just to '''explore & ''''''evaluate '''''''''new DataSets''''''that may or may not end up being used in GlobalQuran.''
  
 
* [Word2Word data]
 
* [Word2Word data]
 
* [Buck data]
 
* [Buck data]
 
* [Corpus grammar data]
 
* [Corpus grammar data]
 +
* [Transliteration data]
  
  
Line 23: Line 24:
  
  
*<u>'''Buck data:'''</u </u>
+
*<u>'''Buck data:'''</u>
**<u>​Objective: Instead of passing Tanzil Quran data as unicode, pass it as ascii. One-to-one mapping of arabic unicode to a ascii, which can be mapped & remapped, without loss of fidelity.</u>
+
**​Objective: Instead of passing Tanzil Quran data as unicode, pass it as ascii. One-to-one mapping of arabic unicode to a ascii, which can be mapped & remapped, without loss of fidelity.
**<u>'''Reliability: **** (4/5)''' as good as Tanzil, the source</u>
+
**'''Reliability: **** (4/5)''' as good as Tanzil, the source
**<u>Demo:&nbsp;[http://qurandev.github.com/ http://qurandev.github.com/]&nbsp; -- see 3rd column</u>
+
**Demo:&nbsp;[http://qurandev.github.com/ http://qurandev.github.com/]&nbsp; -- see 3rd column
**<u>Advantages? </u>
+
**Advantages?
***<u>Buck uses less bandwidth</u>
+
***Buck uses less bandwidth
***<u>In Javascript, u can search thru entire Buck quran text in 1 shot. intuitive compared to Arabic</u>
+
***In Javascript, u can search thru entire Buck quran text in 1 shot. intuitive compared to Arabic search
***<u>You can strip out all vowels from Buck text in few millisecs. Why do this? u can search in javascript, ignoring the taskheel differences (fatha, damma, kasra). They leads to more hits.</u>
+
***Buck to Arabic and Arabic to Buck is a simple js call. Play with live sample here:&nbsp;[http://jsfiddle.net/BrxJP/ http://jsfiddle.net/BrxJP/].
***<u>Regex + buck text can lead to awesome optimizations. All the searches can be run locally. Demo:&nbsp;[http://qurandev.appspot.com/ http://qurandev.appspot.com/]</u>
+
***You can strip out all vowels from Buck text in few millisecs. Why do this? u can search in javascript, ignoring the taskheel differences (fatha, damma, kasra). They leads to more hits.
**<u>How data generated? just one-to-one mapping using:&nbsp;[http://corpus.quran.com/java/buckwalter.jsp http://corpus.quran.com/java/buckwalter.jsp]</u>
+
***Regex + buck text can lead to awesome optimizations. All the searches can be run locally. Demo:&nbsp;[http://qurandev.appspot.com/ http://qurandev.appspot.com/]
 +
**How data generated? just one-to-one mapping using:&nbsp;[http://corpus.quran.com/java/buckwalter.jsp http://corpus.quran.com/java/buckwalter.jsp]
 +
 
 +
 
 +
 
 +
<u>'''Buck data:'''</u>
 +
<ul style="margin-left: 40px;">
 +
<li>​Objective: get grammatical data for each word.</li>
 +
<li>'''Reliability: ** (2/5) &nbsp;'''Corpus is a work-in-progress. 40% complete.'''Will have mistakes!'''</li>
 +
<li>Demo:&nbsp;[http://qurandev.github.com/ http://qurandev.github.com/]&nbsp; -- see last column.
 +
*[http://qurandev.github.com/Grammar.html http://qurandev.github.com/Grammar.html]&nbsp;you can do local live in-browser searches, similar to corpus.quran.com website. can Download & browse locally.</li>
 +
 
 +
 
 +
*Advantages? only source which has raw data published.
 +
*Challenges:&nbsp;
 +
**grammatical data is NOT at word level; its at word-segment level. Each word= n*PREFIX + 1 or 2 STEMS + n*SUFFIX.&nbsp;
 +
**Columns not uniform. i.e.&nbsp;[https://raw.github.com/qurandev/qurandev/master/data/quranic-corpus-morphology-0.4.txt https://raw.github.com/qurandev/qurandev/master/data/quranic-corpus-morphology-0.4.txt]
 +
**demo relies on using heavyduty Javacript RegExs to get info for word & to parse out grammar info. May want to do 1-time premapping into ur own format, but will have to redo if corpus releases new data file.
 +
 
 +
*&nbsp;[Transliteration data]

Revision as of 12:43, 21 January 2012

GlobalQuran is a well written eco-system of UI pages, widgets, APIs, search that anyone can use on their site.

This page is just to explore & 'evaluate ''''new DataSets'that may or may not end up being used in GlobalQuran.

  •  [Word2Word data]
  •  [Buck data]
  •  [Corpus grammar data]
  •  [Transliteration data]


  • Word2Word data:
    • Objective: For each quran word, get the english meaning. No compound word meanings.
    • Reliability: *** (3/5) almost as good as source
    • Demo: [http://qurandev.github.com/ http://qurandev.github.com/]  -- see the 4th column
    • Challenges: as its machine matched, data may not be 100% accurate. For compound words, most meanings are prefixed with **. These are where the meaning is duplicated for side-by-side words.
    • How data generated?: 
      • started with the orig data from [http://allahsquran.com/learn/ http://allahsquran.com/learn/] 
      • snip out all meanings from js. Ex: [http://allahsquran.com/learn/ayas-s112d7q1f0o4.js http://allahsquran.com/learn/ayas-s112d7q1f0o4.js]
      • Identify all places where theres mismatch in #words in tanzil and the meaning segments.
      • where mismatch, either fixed by hand in few cases. or duplicate the meaning after prepend **.
    • Note: Apparently quran.com and corpus.quran.com also had to go thru these word2word data cleanups.


  • Buck data:
    • ​Objective: Instead of passing Tanzil Quran data as unicode, pass it as ascii. One-to-one mapping of arabic unicode to a ascii, which can be mapped & remapped, without loss of fidelity.
    • Reliability: **** (4/5) as good as Tanzil, the source
    • Demo: [http://qurandev.github.com/ http://qurandev.github.com/]  -- see 3rd column
    • Advantages?
      • Buck uses less bandwidth
      • In Javascript, u can search thru entire Buck quran text in 1 shot. intuitive compared to Arabic search
      • Buck to Arabic and Arabic to Buck is a simple js call. Play with live sample here: [http://jsfiddle.net/BrxJP/ http://jsfiddle.net/BrxJP/].
      • You can strip out all vowels from Buck text in few millisecs. Why do this? u can search in javascript, ignoring the taskheel differences (fatha, damma, kasra). They leads to more hits.
      • Regex + buck text can lead to awesome optimizations. All the searches can be run locally. Demo: [http://qurandev.appspot.com/ http://qurandev.appspot.com/]
    • How data generated? just one-to-one mapping using: [http://corpus.quran.com/java/buckwalter.jsp http://corpus.quran.com/java/buckwalter.jsp]


Buck data:

  • ​Objective: get grammatical data for each word.
  • Reliability: ** (2/5)  Corpus is a work-in-progress. 40% complete.Will have mistakes!
  • Demo: [http://qurandev.github.com/ http://qurandev.github.com/]  -- see last column.
    • [http://qurandev.github.com/Grammar.html http://qurandev.github.com/Grammar.html] you can do local live in-browser searches, similar to corpus.quran.com website. can Download & browse locally.
    • Advantages? only source which has raw data published.
    • Challenges: 
      • grammatical data is NOT at word level; its at word-segment level. Each word= n*PREFIX + 1 or 2 STEMS + n*SUFFIX. 
      • Columns not uniform. i.e. https://raw.github.com/qurandev/qurandev/master/data/quranic-corpus-morphology-0.4.txt
      • demo relies on using heavyduty Javacript RegExs to get info for word & to parse out grammar info. May want to do 1-time premapping into ur own format, but will have to redo if corpus releases new data file.
    •  [Transliteration data]

Comments


blog comments powered by Disqus

.