School of Information Studies Information Storage and Retrieval 783INSTRUCTOR: SCHOOL: CLASSROOM: TIME: SEMESTER: OFFICE: OFFICE HOURS: PHONE: EMAIL: Dr. Jin Zhang, Associate Professor School Information Studies, UWM Bolton 521 Thursday, 2:00 pm to 4:40 pm Fall 2009 Bolton 532 Wed, 1:00 pm to 2:00 pm 229-2712
[email protected] 1 Course Materials Textbook: [1] Information Storage and Retrieval by R. R. Korfhage, published by John Wiley & Sons in 1997. ISBN 0-471-14338-3. [2] (Optional) Introduction to Information Retrieval by Manning, C.D., Raghavan, P. and Schütze, H., 2008, Cambridge University Press, ISBN-13: 9780521865715. [3] (Optional) Information Storage and Retrieval Systems Theory and Implementation, Second Edition. Gerald J. Kowalski, Mark T. Maybury September 2000, Kluwer Academic Publisher, ISBN 0-7923-7924-1. Other books: [1] Text Information Retrieval Systems by Charles T. Meadow, published by Academic Press, Inc. in 2007, ISBN 0-12-369412-4. (Third Version) [2] INFORMATION RETRIEVAL by C. J. van RIJSBERGEN B.Sc., Dip. NAAC, Ph.D., M.B.C.S., F.I.E.E., C. Eng., F.R.S.E. The book is available at http://www.dcs.gla.ac.uk/Keith/Preface.html [3] (Optional) Visualization for Information Retrieval by Zhang, J. 2008, Srpinger, ISBN: 978-3-540-75147-2 2 Course Description This course on information storage and retrieval focuses on the theory and concepts of information retrieval system, introduces the basic principles of information storage, processing, and retrieval in terms of the information retrieval system analysis and design. The knowledge, experience and background in information systems are preferred. Pre-requisites: L&I Sci 571; or cons instr. 3 4 Course Credit Graduate, 3 credits Course Objectives The aim of this course is to prepare students as information retrieval system analysts and designers. comparison of different types of information retrieval systems and application of expert system in information retrieval. the second is to address how information is processed within an information retrieval system. output presentation.Generally speaking. it is external. It is extremely important for you to understand the grading policies and obtain high points on your assignments. reference points. The topics in this courses include query structure and its characteristics. presentations like relevance measure review. document analysis. alternative retrieval techniques. as well as a discussion of current research trends in the field. but undistinguished work 74-76 70-73 67-69 64-66 60-63 below 60 C work is below standard CD+ D unsatisfactory work DF 96-100 91-95 88-90 84-86 80-83 77-79 Assignments include. internal matching mechanisms. the representation of documents and other objects within an information system. Assignment Weekly Assignments (9) Participation & Discussion Grade 45% (5% each) 10% . Probability. abstracting. The information retrieval and storage focuses on the latter. and information system use. visualization for information. automatic indexing. data file structures. 5 Course Grading A AB+ B BC+ superior work satisfactory. it is internal. retrieval effectiveness measure. and are not limited to. information retrieval includes two different levels: the first one is to effectively use information in an already existing information retrieval system. the second level. user’s perspective. and Vector Space Models To outline the structure of queries and documents To articulate fundamental functions used in information retrieval such as automatic indexing. The objectives are: To outline basic terminology and components in information storage and retrieval systems To compare and contrast information retrieval models and internal mechanisms such as Boolean. the Internet search engine. A significant portion of your grade is determined by your individual assignments. and clustering To critically evaluate information retrieval system effectiveness and improvement techniques To understand the unique features of Internet-based information retrieval To describe current trends in information retrieval such as information visualization. Week 2 Vocabulary control and data compression Content . Function overview . Abstraction . Reading: Chapter1 Introduction to Information Retrieval by Manning Lecture notes N. 7 Course Schedule Week 1 Introduction Content What is information retrieval . Information filtering and information retrieval: Two sides of the same coin? Communications of the ACM. Algorithm . Croft. Research topics in IR . J. B. You will be graded on your participation and contributions to class discussions. Belkin and W. Physical organization. 35(12):29–38. Logical organization. Definition of information retrieval system. Components of information retrieval systems. Significance of information retrieval and storage. 1992. Relationships between Digital library and IRS . Data structure. Comparisons among different information systems . Measure of information systems.Project 45% 6 Attendance & Class Participation: Attendance is mandatory and class participation is expected. Objectives of information retrieval system . TREC . processing query expression: reverse Poland Expression. vocabulary control . N-gram data structure. structure of a thesaurus. bit and byte. fine data structure. type of documents . structure of a sequential file . structural representation. sequential file. Huffman code technique . rules for operations . tree structure . types of data structure. limitations of Boolean logic. document surrogates. query . inverted file. hash approach. data compression. Information Storage and Retrieval by Korfhage Chapter 2 Introduction to Information Retrieval by Manning Lecture notes Week 3 Database file structures Content Information organization structures. differences between documents and queries . Boolean logic . structure of an index file . signature file structure. Information Storage and Retrieval Systems by Korflage Lecture notes Week 4 Boolean retrieval systems Content Matching criteria. Reading: Chapter 2. Reading: Chapter2. document-term matrix. methods for designing weights to terms . spatial representation of a document in vector model. Information Storage and Retrieval by Korfhage Chapter 14 Introduction to Information Retrieval by Manning Lecture notes Week 6 Probability Retrieval System Content Basic concepts of probability. some considerations for the vector model. representation of documents in the probability model. Reading: Chapters 3 and 4. Bayes theorem. probability search. assumptions of the probability model. Similarity between a query and a document (approach I). Information Storage and Retrieval by Korfhage Chapter 11 Introduction to Information Retrieval by Manning Lecture notes . Information Storage and Retrieval by Korfhage Chapter 1 Introduction to Information Retrieval by Manning Lecture notes Week 5 Vector retrieval system Content Vector model. probability theory.Reading: Chapters 3 and 4. similarity between a query and a document (approach II). statistical independence. discrimination function. Reading: Chapters 3and 4. query in the vector model. (2004). S.503-520. 60(5). Robertson. In Proceedings of ACM SIGIR’94. Van Rijsbergen. Fuhr. 61-85. Journal of Documentation. Information Processing and Management.S. 3–12. Reading: Chapter 4 Introduction to Information Retrieval by Manning Zhang J. In Proc. 1991. Cleverdon. Journal of Intelligent Information Systems. The significance of the Cranfield tests on index languages.K. (1996). and Iain Campbell. inverse term frequency approach. ACM Computing Surveys 30(4):528–552.Understanding inverse document frequency: On theoretical arguments for IDF.. SIGIR. Probabilistic models in information retrieval. ACM Press. & Singhal. A new term significance weighting approach. 24(1). and other considerations. stop list approach . A. raw term frequency approach. normalized term frequency approach. Computer Journal 35(3):243–255. 1992. Crestani. Salton. .. Walker. E. pp. probably: A survey of probabilistic models in information retrieval. why use automatic indexing. Automatic text decomposition and structuring. purpose of indexing. pp. 1998. pp. Fabio. Cornelis J. J. pages 232–241. Norbert. G.127-138. CyrilW. Is this document relevant?. automatic indexing . Mounia Lalmas. Allan. Some simple effective approximations to the 2–poisson model for probabilistic weighted retrieval. 32(2). Robertson and S. . 1994. . and Nguyen T (2005). Week 7 Automatic indexing and abstracting Content Indexing. 1998. Griffiths. Brian T. SIGIR Forum 32(1). Cottrell. the K-means clustering algorithm. Alistair. comparisons of the two kinds of similarity approaches. term association. non. Reading: Chapters 16 and 17 Introduction to Information Retrieval by Manning Lecture notes . K-means in SPSS. Exploring the similarity space. and Rasmussen E (2001).Week 8 Similarity measure algorithms Content Data fusion. An experimental study on the iso-content-based angle similarity measure. criteria of clustering. differences between clustering and classification. JASIS 49(8):742–761. Optimizing similarity using multi-query relevance feedback.Willett. and Rasmussen E (2002). Belew. H. C. 37:3–11. categorization of clustering algorithms. A. Information Processing & Management. 279-294. 38(3). general similarity measures. extended user profile. Using interdocument similarity in document retrieval systems. and Richard K. current awareness systems . Moffat. and P. Reading: Chapters 4 and 5. 325-342. Bartell. hierarchical clustering algorithm.. 1998.hierarchical clustering algorithm. similarity measures in the vector retrieval model. Information Storage and Retrieval by Korfhage. and lecture notes Zhang J. significance of a clustering approach in IR. 1986. hierarchy cluster in SPSS. Zhang J. Developing a new similarity measure from two different perspectives. Week 9 Automatic clustering approaches Content Definition of automatic clustering. 37(2). Journal of the American Society for Information Science. modifying the query by the user profile. and Justin Zobel. Luckhurst. Information Processing & Management. Garrison W. reference point. retrospective search systems. Keim. navigation problems on WWW . 2002. pp. Jain. and Karypis. Virginia.38-44. S. McLean. 8. Englewood Cliffs. Hearst. Hamerly.419-442). G. User interfaces and visualization. history of visualization. (1999). visualization of web-based information . 2003. Anil. chapter 10. Frakes and R. Readings in information visualization: using vision to think. pp. Data clustering: A review. Week 10 Theory: Information Visualization Content Visualization. M. Greg. In R. Murtagh. URL: books. Harlow. Baeza-Yates and B. (1999). Addison Wesley. pp. Boolean-based information retrieval system. Narasimha Murty. Learning the k in k-means. In Proc. why use visualization for information retrieval . potential research topics. visualization for information retrieval. Y. and Patrick Flynn. Ribeiro-Neto. Modern Information Retrieval. Reading: Chapter 1 Visualization for Information Retrieval by Zhang Lecture notes Card. 1999. (2002). B. editors. Fionn. In W.Rasmussen. (1992). pp. 257--323.. San Francisco: Morgan Kaufmann.K. consideration from cognitive engineering. NIPS. Baeza-Yates (Eds. 44. USA. A. November 04-09. non-Boolean-based information retrieval system. Evaluation of hierarchical clustering algorithms for document databases. analysis of traditional information retrieval systems. . (2001). Clustering algorithms..A. Machinlay. core of visualization for information retrieval. In Proceedings of the eleventh international conference on Information and knowledge management. A survey of recent advances in hierarchical clustering algorithms. Computer Journal 26(4):354–359.: Prentice Hall. technical environment for the visualization. E. Zhao. ACM Computing Surveys 31(3):264–323.) Information retrieval: data structures & algorithms (pp.pdf. NJ.515-524. Communications of the ACM. 1-34.D.nips. 1983. functionality of visualization . M. B. D.cc/papers/files/nips16/NIPS2003_AA36. Visual exploration of large data sets. J. and Charles Elkan. and Shneiderman. Proceeding of Eurographics’95.349-360. Las Vegas. pp. 1994.Week 11 Systems: Information Visualization Content Visualization systems . Hearst MA (1995). and Olsen KA (1994). WebMap. LifeLines. . Snowdon D. Maastricht. DARE. May 7-11. JAIR INFORMATION SPACE . Nevada. Visualising semantic spaces and author co-citation networks in digital libraries. Visual thesaurus. Colorado. VIBE. April 11-13th. 59-66. Proceedings of Third Annual Symposium on Document analysis and Information retrieval’94. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems’95. Chen C (1999). 35(3). pp. Greenhalgh C. NiF Elastic Catalog. Denver. 1995. Korfhage RR. Tilebars. 1995. Information Processing and Management. 401-420. Web Brain. Dynamic Diagrams. and Knox I (1995). Ingram R. Tree map. TileBars: visualization of term distribution information in full text information access. Excentric Labeling.199-207. pp. August 30th-September 1st . SQWID. Inxight. The role of visualization in document analysis. Health InfoPark Reading: Chapter 8 Visualization for Information Retrieval by Zhang Lecture notes Benford S. Reveal things. VR-VIBE: A Virtual Environment for Co-operative Information Retrieval. WWW. WWW. depth first approach. Flake. Steve Lawrence. and Lawrence Page.R. URL: jcmc. Gerrand. breadth first approach. DARE: Distance and Angle Retrieval Environment: A Tale of the Two Measures. and Korfhage. Glover. Italy. pp. 50(9). Journal of the American Society for Information Science. 562–569. jargons .indiana. In Proc. Zhang J. 1998. Isle Capri. A. web page ranking. 2002. Week 12 Internet Information Retrieval Content Challenge in the Web . trends Reading: lecture notes Chapters 9 and 10 Introduction to Information Retrieval by Manning Brin. (1997). David M. and Korfhage RR (1999). Sergey. crawlers . SIGIR Forum 36(2):3–10.edu/vol12/issue4/gerrand. GUIDO: visualizing document retrieval. language distribution. crawling the Web . 779-787. pp. 2007. R. 107–117. 2002. In Proc. Kostas Tsioutsiouliklis.. Eric J. Broder. A taxonomy of web search. September 23-26. Estimating linguistic diversity on the internet: A taxonomy to avoid pitfalls and paradoxes. 1997.html. considerations for meta-search engines. and Gary W. Using web structure for classifying and describing web pages.Nuchprayoon. Proceedings of the IEEE Information Visualization symposium’97.184-188. Journal of Computer-Mediated Communication 12(4). Week 13 Evaluation issues Content . The anatomy of a large-scale hypertextual web search engine. pp. Peter. crawling approach . Pennock. meta-search . centralized architecture. ACM Press. Andrei. Saracevic. T. * The schedule may be changed Summary: Week1. Oct 15 Week8. (2007). (2007). Information Storage and Retrieval by Korfhage Chapter8 Introduction to Information Retrieval by Manning Lecture notes Saracevic. Oct 29 Week10. JASIS 47:37–49. Relevance: A review of the literature and a framework for thinking on the notion in information science. Oct 22 Week9.Seven criteria for evaluation for information retrieval. Kappa measure. quality versus quantity. Week 14 Student presentation Note: * If you are a student with special needs. Stephen P. Nov 5 Week11. relevance issue. Harter. 2126-2144. Grandfield experimental study Reading: Chapter 8. Journal of the American Society for Information Science and Technology. Oct 8 Week7. Relevance: A review of the literature and a framework for thinking on the notion in information science. Average recall and average precision. Dec 10 Introduction Vocabulary control and data compression Database file structures Boolean retrieval systems Vector retrieval system Probability Retrieval System Automatic indexing and abstracting Similarity measure algorithms Automatic clustering approaches Theory: Information Visualization Systems: Information Visualization Internet Information Retrieval No class Evaluation issues Project presentation . please feel free to discuss them with the instructor. Nov 19 Week13. 58(3). evaluation of a search engine. Sept 3 Week2. Possible factors which influence outcome of a search. Nov 26 Thanksgiving recess Week14. T. Harmonic mean. 1998. Sept 10 Week3. Dec 3 Week15. 19151933. Oct 1 Week6. 58(13). Nov 12 Week12. Part III: Behavior and effects of relevance. Part II: nature and manifestations of relevance. Sept 17 Week4. Journal of the American Society for Information Science and Technology. Variations in relevance assessments and the measurement of retrieval effectiveness. Sept 24 Week5. Papers must integrate a minimum of 15 relevant sources. discuss the state of the art of the topic. On Boolean-based information retrieval system [6]. Evaluation of an information visualization system <SOIS and University Policies to be inserted> . Other information retrieval models [11]. Automatic indexing theory and practice [4]. Comparison between Boolean-based and Vector-based information systems [7]. Image information organization and retrieval [3].Term paper topic list Students will develop a 15-20 page paper on one of the topics listed below. evaluate sample systems. Music information retrieval [2].org/) [1].apa. Visualization of information: theoretical aspect [8]. and outline future directions for the area. Papers should use the American Psychological Association (APA) style (http://apastyle. Papers will characterize current issues associated with the topic. Automatic indexing/abstracting theory and practice [10]. Evaluation of search engines [5]. Visualization of information: system aspect [9].