Evaluation of Multimedia Retrieval System

1Evaluation of Multimedia Retrieval Systems 12/6/14 Multimedia Indexing & Retrieval Evaluation of Multimedia Retrieval Systems Evaluation of Multimedia Retrieval Systems 12/6/14 Outline ¤  Introduction ¤  Test Collections and Evaluation Workshops ¤  Evaluation Measures ¤  Significance Tests ¤  A Case Study: TRECVID 2 Evaluation of Multimedia Retrieval Systems 12/6/14 Outline ¤  Introduction ¤  Test Collections and Evaluation Workshops ¤  Evaluation Measures ¤  Significance Tests ¤  A Case Study: TRECVID 3 Evaluation of Multimedia Retrieval Systems 12/6/14 Introduction ¤  We are going to discuss the tools and methodology for comparing the effectiveness of two or more multimedia retrieval systems in a meaningful way ¤  Multimedia retrieval system evaluation can be evaluated using two strategies: 1.  Without consulting the potential users of the system e.g. query processing time, query throughput, etc. 2.  With consulting the potential users of the system e.g. the effectiveness of retrieved results 4 Evaluation of Multimedia Retrieval Systems 12/6/14 Test Collections ¤  Doing an evaluation involving real people is not only a costly job. these test collections are created by consulting potential users ¤  Test collection can be used to evaluate multimedia retrieval systems without the need to consult the users during further evaluations ¤  A new retrieval method or system can be evaluated by comparing it to some well-established methods in a controlled experiment 5 . it is also difficult to control and hard to replicate ¤  Methods have been developed to design so-called test collections ¤  Often.   3. for instance image search using example images as queries Ground truth data (correct answers) ¤  One or more suitable evaluation measures that assign values to the effectiveness of the search ¤  A statistical methodology that determines whether the observed differences in performance between the methods investigated are statistically significant 6 .  Multimedia data A task and data needed for that task.  2. 1993) ¤  A test collection consisting of: 1.Evaluation of Multimedia Retrieval Systems 12/6/14 Evaluation Methods in Controlled Experiment (Hull. Evaluation of Multimedia Retrieval Systems 12/6/14 Glass box vs Black Box ¤  Two approaches to test the effectiveness of multimedia search system: 1.  Glass box evaluation.. i.e. i.e.  Black box evaluation. or shot detector subsystem of video retrieval system 2. testing the system as a whole 7 .. the systematic assessment of every component of a system example: word error rate on speech recognition. Evaluation of Multimedia Retrieval Systems 12/6/14 Outline ¤  Introduction ¤  Test Collections and Evaluation Workshops ¤  Evaluation Measures ¤  Significance Tests ¤  A Case Study: TRECVID 8 . for instance Web search companies such as Yahoo and Google. might perform such evaluations themselves ¤  For smaller companies and research institutions a good option is often to take a test collection that is prepared by others ¤  A test collections are often created by the collaborative effort of many research groups in so-called evaluation workshops of conferences i.e. Text Retrieval Conferences (TREC) 9 .Evaluation of Multimedia Retrieval Systems 12/6/14 Test Collections and Evaluation Workshops ¤  The scientific evaluation of multimedia search systems is a costly and laborious job ¤  Big companies. Evaluation of Multimedia Retrieval Systems 12/6/14 TREC-style Evaluation Cycle 10 . Evaluation of Multimedia Retrieval Systems 12/6/14 Ground Truth Data Preparation ¤  For black-box evaluations. the process of deciding whether a sequence of frames contains a shot boundary involves people as well 11 . people are involved in the process of deciding whether a multimedia object is relevant for a certain query ¤  For glass-box evaluations. for instance an evaluation of a shot boundary detection algorithm. sun.sets. or English pub signs. which is divided into subsets of images (e. Arabian horses.Evaluation of Multimedia Retrieval Systems 12/6/14 Corel Set ¤  The Corel test collection is a set of stock photographs. etc.g.) ¤  The collection is used to evaluate the effectiveness of content-based image retrieval systems in a large number of publications and has become a de-facto standard in the field 12 .. Corel is used as follows: 1.) ¤  Usually. since only come from one professional photographer (not represents realistic settings) 13 . Relevant images for the query are those that come from the same theme.  From the test collection.  3.Evaluation of Multimedia Retrieval Systems 12/6/14 Corel Set (cont. and Compute precision and recall for a number of queries ¤  Problems: ¤  Single “Corel set” does not exist ¤  Different publications use different CDs and different selections of themes ¤  Too simple. take out one image and use it as a query-by-example.  2. set detection. Tempo extraction. Melody extraction. Key finding. Symbolic genre classification.Evaluation of Multimedia Retrieval Systems 12/6/14 Mirex Workshops ¤  The Music Information Retrieval Evaluation eXchange (MIREX) test collections consist of data from record labels that allow to publish tracks from their artists’ releases ¤  MIREX uses a glass box approach to the evaluation of music information retrieval by identifying various subtasks e. Genre classification. On.g. Symbolic melodic similarity 14 . identification. Drum detection. Chinese and Arabic sources e. from US.shop emerged as a special task of the Text Retrieval Conference (TREC) in 2001 ¤  Continued as an independent workshop collocated with TREC later ¤  The workshop has focused mostly on news videos.Evaluation of Multimedia Retrieval Systems 12/6/14 TRECVID: The TREC Video Retrieval Workshops ¤  The work. CNN Headline News.g. and ABC World News Tonight 15 . Evaluation of Multimedia Retrieval Systems 12/6/14 TRECVID tasks (example) ¤  TRECVID provides several white-box evaluation tasks such as: ¤  Shot boundary detection ¤  Story segmentation ¤  High-level feature extraction ¤  Etc. ¤  The workshop series also provides a black-box evaluation framework: a Search task that may include the complete interactive session of a user with the system 16 . became independent from TREC in 2000 17 . retrieving data from a collection of documents in many languages. i. a workshop series mainly focusing on multilingual retrieval. and crosslanguage retrieval ¤  CLEF started as a cross-language retrieval task in TREC in 1997..Evaluation of Multimedia Retrieval Systems 12/6/14 CLEF Workshops ¤  CLEF (Cross-Language Evaluation Forum).e. etc.Evaluation of Multimedia Retrieval Systems 12/6/14 ImageCLEF ¤  Goal: to investigate the effectiveness of combining text and image for retrieval ¤  ImageCLEF also provides an evaluation task for a medical image collection consisting of medical photographs. CTscans. provided by the Geneva University Hospital in the Casimage project 18 . MRIs. X-rays. Evaluation of Multimedia Retrieval Systems 12/6/14 INEX Workshops ¤  The Initiative for the Evaluation of XML Retrieval (INEX) aims at evaluating the retrieval performance of XML retrieval systems ¤  It focuses on using the structure of the document to extract. relate and combine the relevance scores of different multimedia fragments 19 . Evaluation of Multimedia Retrieval Systems 12/6/14 INEX Workshop task (example) ¤  INEX 2006 had a modest multimedia task that involved data from the Wikipedia ¤  A search query might for instance search for images by combining information from its caption text. its links. and from the article that contains the image 20 . Evaluation of Multimedia Retrieval Systems 12/6/14 Outline ¤  Introduction ¤  Test Collections and Evaluation Workshops ¤  Evaluation Measures ¤  Significance Tests ¤  A Case Study: TRECVID 21 . Evaluation of Multimedia Retrieval Systems 12/6/14 Evaluation Measures ¤  The effectiveness of a system or component is often measured by the combination of precision and recall ¤  Precision is defined by the fraction of the retrieved or detected objects that is actually relevant. Recall is defined by the fraction of the relevant objects that is actually retrieved Entire document Relevant collection documents recall = Retrieved documents Number of relevant documents retrieved Total number of relevant documents precision = Number of relevant documents retrieved Total number of documents retrieved 22 . The aggregate of relevant items is taken as the total relevant set. 23 .Evaluation of Multimedia Retrieval Systems 12/6/14 Determining Recall is Difficult ¤  Total number of relevant items is sometimes not available: ¤  Sample across the database and perform relevance judgment on these items. ¤  Apply different retrieval algorithms to the same database for the same query. Evaluation of Multimedia Retrieval Systems 12/6/14 Trade-off between Recall and Precision Returns relevant documents but misses many useful ones too The ideal Precision 1 0 Recall 1 Returns most relevant documents but includes lots of junk 24 . Evaluation of Multimedia Retrieval Systems 12/6/14 Computing Recall/Precision Points ¤  For a given query. and therefore different recall/precision measures ¤  Mark each document in the ranked list that is relevant according to the gold standard ¤  Compute a recall/precision pair for each position in the ranked list that contains a relevant document 25 . produce the ranked list of retrievals ¤  Adjusting a threshold on this ranked list produces different sets of retrieved documents. 5. Never reach R=5/6=0.Evaluation of Multimedia Retrieval Systems 12/6/14 Computing Recall/Precision Points (example) n doc # relevant 1 588 x 2 589 x 3 576 4 590 x 5 986 6 592 x 7 984 8 988 9 578 10 985 11 103 12 591 13 772 x 14 990 Let total # of relevant docs = 6 Check each new recall point: R=1/6=0. P=3/4=0. P=2/2=1 R=3/6=0.38 100% recall 26 . P=4/6=0.667.75 R=4/6=0.833.667 Missing one relevant document. p=5/13=0.333. P=1/1=1 R=2/6=0.167. 0} ¤  The levels correspond to users that are satisfied if they find respectively 10%. 20%. for instance 10 levels: {0.Evaluation of Multimedia Retrieval Systems 12/6/14 Precision at Fixed Recall Levels ¤  A number of fixed recall levels are chosen.1. 1. 0. 100% of the relevant documents ¤  For each of these levels the corresponding precision is determined by averaging the precision on that level over the tests that are performed 27 .2.···. · · · . 500 and 1000 documents retrieved ¤  These points correspond with users that are willing to read 5. for instance nine points at: 5. the precision is determined by averaging the precision on that level over the queries 28 . 15. 20. etc. choose a number of fixed points in the ranked list. but unmanageable for most users in the second case ¤  Alternative.Evaluation of Multimedia Retrieval Systems 12/6/14 Precision at Fixed Points in the Ranked List ¤  Recall is not necessarily a good measure of user equivalence ¤  For instance if one query has 20 relevant documents while another has 200 ¤  A recall of 50% would be a reasonable goal in the first case. 100. 200. documents of a search ¤  For each of these points in the ranked list. 10. 10. 30. 15. Evaluation of Multimedia Retrieval Systems 12/6/14 Mean Average Precision ¤  Average Precision: Average of the precision values at the points at which each relevant document is retrieved. 29 .38 + 0)/6 = 0.75 + 0. ¤  Example : (1 + 1 + 0.667 + 0.633 ¤  Mean Average Precision: Average of the average precision value for a set of queries. both need to be high for harmonic mean to be high. 30 .Evaluation of Multimedia Retrieval Systems 12/6/14 Combining Precision/Recall: F-Measure ¤  One measure of performance that takes into account both recall and precision ¤  Harmonic mean of recall and precision: 2 PR 2 F= =1 1 P + R R+P ¤  Compared to arithmetic mean. Evaluation of Multimedia Retrieval Systems 12/6/14 Outline ¤  Introduction ¤  Test Collections and Evaluation Workshops ¤  Evaluation Measures ¤  Significance Tests ¤  A Case Study: TRECVID 31 . Evaluation of Multimedia Retrieval Systems 12/6/14 Significance Test ¤  Simply citing percentage improvements of one method over another is helpful ¤  It does not tell if the improvements were in fact due to differences of the two methods ¤  A differences between two methods might simply be due to random variation in the performance ¤  To make significance testing of the differences applicable. a reasonable amount of queries is needed ¤  A technique called cross-validation is used to further prevent biased evaluation results 32 . Evaluation of Multimedia Retrieval Systems 12/6/14 Cross Validation (example) 33 . Evaluation of Multimedia Retrieval Systems 12/6/14 Significance Tests Assumptions ¤  Significance tests are designed to disprove the null hypothesis H0 ¤  The null hypothesis will be that there is no difference between method A and method B.  Method A consistently outperforms method B. or 2. ¤  Rejecting H0 implies accepting the alternative hypothesis H1 ¤  The alternative hypothesis for the retrieval experiments will be 1.  Method B consistently outperforms method A 34 .   The paired t-test: assumes that errors are normally distributed 2.  The paired Wilcoxon signed ranks test: is a nonparametric test that assumes that errors come from a continuous distribution that is symmetric around 0 3.Evaluation of Multimedia Retrieval Systems 12/6/14 Common Significance Tests 1.  The paired sign test: is a non-parametric test that only uses the sign of the differences between method A and B for each query. The test statistic is the number of times that the least frequent sign occurs 35 . 241 for system B ¤  Are these different significant? 36 . the mean average precision is 0. return a ranked lists of images from a standard benchmark test collection ¤  For each system.208 for system A and 0.Evaluation of Multimedia Retrieval Systems 12/6/14 Significance Test (example) ¤  We are going to compare the effectiveness of two content-based image retrieval systems A and B ¤  Given an example query image. we run 50 queries from the benchmark ¤  Based on benchmark’s relevance judgments. Evaluation of Multimedia Retrieval Systems 12/6/14 Outline ¤  Introduction ¤  Test Collections and Evaluation Workshops ¤  Evaluation Measures ¤  Significance Tests ¤  A Case Study: TRECVID 37 . there were four main tasks in TRECVID: Glass box tasks: 1.  Shot boundary detection 2.  3.Evaluation of Multimedia Retrieval Systems 12/6/14 A Case Study: TRECVID ¤  In 2004.  Story segmentation High-level feature extraction Black box task: Search 38 . or a subsequence of that selected by the movie editor ¤  Shot boundaries: the positions in the video stream that contain the transitions from one shot to the next shot ¤  There are two kind of boundaries: 1. and stopping when it was switched off again.  Soft cuts: there is a sequence of frames that belongs to both the first shot and the second shot. Harder to detect 39 . Easy to detect 2.  Hard cuts: a relatively large difference between the two frames.Evaluation of Multimedia Retrieval Systems 12/6/14 Shot Boundary Detection ¤  A shot marks the complete sequence of frames starting from the moment that the camera was switched on. Evaluation of Multimedia Retrieval Systems 12/6/14 TRECVID 2004 Shot Boundary Detection Task ¤  Goal: to identify the shot boundaries with their location and type (hard or soft) in the given video clips ¤  The performance of a system is measured by precision and recall of detected shot boundaries ¤  The detection criteria require only a single frame overlap between the submitted transitions and the reference transition ¤  The TRECVID 2004 test data: ¤  618. mostly dissolves) ¤  The best performing systems: 95% precision and recall on hard cuts.409 frames ¤  4806 shot transitions (58% hard cuts. 80% precision and recall on soft cuts 40 . 42% soft cuts. Evaluation of Multimedia Retrieval Systems 12/6/14 Hard cuts vs Soft cuts Hard cuts Soft cuts with dissolve effect 41 . audio and textual cues present in the data.Evaluation of Multimedia Retrieval Systems 12/6/14 Story Segmentation ¤  A digital video retrieval system with shots as the basic retrieval unit might not be desirable a news shows and news magazines ¤  The segmentation of a news show into its constituting news items has been studied under the names of story boundary detection and story segmentation ¤  Story segmentation of multimedia data can potentially exploit visual. ¤  Story boundaries often but not always occur at shot boundaries 42 . identify the story boundaries with their location (time) in the given video clip(s) ¤  The definition of the story segmentation task was based on manual story boundary annotations made by Linguistic Data Consortium (LDC) for the Topic Detection and Tracking (TDT) project ¤  Experiments setup: ¤  Video + Audio (no ASR/CC) ¤  Video + Audio + ASR/CC ¤  ASR/CC (no Video + Audio) 43 .Evaluation of Multimedia Retrieval Systems 12/6/14 TRECVID 2004 Story Segmentation Task ¤  Goal: given the story boundary test collection. Evaluation of Multimedia Retrieval Systems 12/6/14 Story Segmentation (Zhai. and the green cycle represents a miscellaneous story 44 . The blue cycle and the red cycle are the news stories. 2005) ¤  The video consists of two news stories and one commercial. it was considered a false alarm 45 . accurate to nearest hundredth of a second ¤  Each reference boundary was expanded with a fuzziness factor of five seconds in each direction ¤  If a computed boundary did not fall in the evaluation interval of a reference boundary.Evaluation of Multimedia Retrieval Systems 12/6/14 Story Boundary Evaluation ¤  A story boundary was expressed as a time offset with respect to the start of the video file in seconds. vegetation ¤  Action: people walking. car. politics. map. scape. aircraft. boat. sports.Evaluation of Multimedia Retrieval Systems 12/6/14 High-level Feature Extraction ¤  One way to bridge the semantic gap between video content and textual representations is to annotate video footage with high-level features ¤  The assumption is that features describing generic concepts e. mountain.: ¤  Genres: commerial. fire. financial ¤  Objects: face. outdoor. people in crowd 46 . weather. explosion. building ¤  Scene: water. sky. indoor.g. Evaluation of Multimedia Retrieval Systems 12/6/14 Video Search Task ¤  Goal: to evaluate the system as a whole ¤  Given the search test collection and a multimedia statement of a user’s information need (called topic in TRECVID) ¤  Return a ranked list of at most 1000 shots from the test collection that best satisfy the need ¤  A topic consists of a textual description of the user’s need. for instance: “Find shots of one or more buildings with flood waters around it/them.” and possibly one or more example frames or images 47 . Evaluation of Multimedia Retrieval Systems 12/6/14 Manual vs Interactive Search Task 48 . Documents Similar To Evaluation of Multimedia Retrieval SystemSkip carouselcarousel previouscarousel nextSubstructure Similarity Measurement in Chinese Recipes Cópia de 11Full_ThesisCe.measures0610.1.1.93The Effectiveness of a Dictionary-Based TechniqueFile Search with Query Expansion in a Network System(s).pdf10.1.1.67lecture8-evaluationField Directed Subject Search Options in J-Gate _6_A Malay-English Terminology Retrieval System5.Eng-The Relationship Between User Preferences and IR -Harvey Hyman1Retrieval+EvaluationTalashA New Email Retrivel Ranking ApproachTREC Evalution MeasuresThe Machine Learning Method Regarding Efficient Soft Computing and Ict Using Svmmethod for monitoringFuzzy LogicFast Index Based Filters for Music RetrievalInteractive News Feed Extraction System-2dedupesubwebsTHE CONCEPT OF STOPWORDS IN PERSIAN CHEMISTRY ARTICLESContent ServerShrec’17 Trackp335-carbonellSemantic Search Log for Social Personalized SearchUsing WordNet in a Knowledge-Based ApproachAssignment 1Trev 2008-Q1 Dimino (Open Source)More From sanimisaSkip carouselcarousel previouscarousel nextPengelolaan Database Referensi MendeleySLPDB Dataset in ExcelMempersiapkan Database Referensi MendeleyBerkenalan dengan Mendeley[14] Kernel Dimensionality Reduction for Supervised LearningIntroduction to MatlabBwr Wavelet.mKernel Dimensionality Reduction on Sleep Stage Classification using ECG SignalBest Books About Information RetrievalReal Time Analytics with SAP HANAby Vinay SinghOracle SQL and PL/SQLby Niraj GuptaSQL All-in-One For Dummiesby Allen G. TaylorPower Pivot and Power BI: The Excel User's Guide to DAX, Power Query, Power BI & Power Pivot in Excel 2010-2016by Rob Collie and Avichal SinghDatabase Development For Dummiesby Allen G. TaylorMastering Python for Data Scienceby Samir MadhavanFooter MenuBack To TopAboutAbout ScribdPressOur blogJoin our team!Contact UsJoin todayInvite FriendsGiftsLegalTermsPrivacyCopyrightSupportHelp / FAQAccessibilityPurchase helpAdChoicesPublishersSocial MediaCopyright © 2018 Scribd Inc. .Browse Books.Site Directory.Site Language: English中文EspañolالعربيةPortuguês日本語DeutschFrançaisTurkceРусский языкTiếng việtJęzyk polskiBahasa indonesiaSign up to vote on this titleUsefulNot usefulYou're Reading a Free PreviewDownloadClose DialogAre you sure?This action might not be possible to undo. Are you sure you want to continue?CANCELOK

Evaluation of Multimedia Retrieval System

Comments

Description