Department of Computer Science and Information Technology
Institute of Advanced Studies in Basic Sciences

Text Mining and Web Mining - Information Retrieval (Spring 2019)

First Instructor: Dr. Ebrahim Ansari
Office Hours: See my weekly Schedule
Location: Computer Science and Information Technology Dept., Room 219

Second Instructors: Dr. Mehdi Bohlouli
Location: Computer Science and Information Technology Dept., Room 215


Teacher Assistants
Name E-mail Address Role
Hadi Abdi Khojasteh hkhojasteh@iasbs.ac.ir Proctor
Mohammad Mahmoudi m.mahmodi@iasbs.ac.ir Proctor
Hamid Hagdoust hamid.h@iasbs.ac.ir Proctor

Required Text: Introduction to Information Retrieval
Author: C. Manning, P. Raghavan, and H. Schütze

Supplementary Material 01: Natural Language Processing (Almost) from Scratch
Author: Ronan Collobert, Jason Weston†, Leon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa

Supplementary Material 02: Natural Language Processing with Python
Author: Steven Bird, Ewan Klein, and Edward Loper

Supplementary Material 03: A Revealing Introduction to Hidden Markov Models
Author: Mark Stamp
You may see the pdf file here

Other Supplementary Materials: Scientific papers which will be announced during the course.


Comments:

This course will present advanced topics Text and web mining including information retrieval systems. As the main requirement for our course, we should know how to process and pre-process text using a programming language like Python. To reach this goal reading the first chapters of the book "Natural Language Processing with Python" is very useful. You can find the on-line version of it here: "Natural Language Processing with Python"

We will complete this page gradually.


Schedule:

Homework 01: Python primaries
Due date: Sat, 13/04, 23:59 (GMT+4:30)
Total points: 1
Delivery: You should send python files (all of them should be enclosed in one email) to the email address of Mr. Hamid Haghdoust before the deadline.
Description: After reading the first chapter of "Natural Language Processing with Python" book which is available here , answer the following questions (text1 ... text9 and sent1 ...sent9 are introduced in the chapter 1):
  1. Write a program to find the collocations in text5.
  2. Write the slice expression that extracts the last two words of text2.
  3. Write a program to find all the four-letter words in the Chat Corpus (text5). With the help of a frequency distribution (FreqDist), show these words in decreasing order of frequency.
  4. Define sent to be the list of words ['she', 'sells', 'sea', 'shells', 'by', 'the', 'sea', 'shore']. Now write code to perform the following tasks: (a) Print all words beginning with sh. (b) Print all words longer than four characters
  5. Using list addition, and the set and sorted operations, write a program to compute the vocabulary of the sentences sent1 ... sent8.
  6. Define a function called vocab_size(text) that has a single parameter for the text, and which returns the vocabulary size of the text.
  7. Define a function percent(word, text) that calculates how often a given word occurs in a text, and expresses the result as a percentage.


TA session 01: Sun, 07/04/2019, 11:00-12:30
Lecturer: Mr. Hadi Abdi Khojasteh or Mr. Mohammad Mahmoudi
Description: He will review and answer your questions about the first chapter of "Natural Language Processing with Python" book.

TA session 02: If required (will be decided based on the first TA session). Tue, 09/04/2019, 11:00-12:30
Lecturer: Mr. Hadi Abdi Khojasteh
Description: He will continue teaching the concepts of the first session.

Homework 02: Pagerank algorithm
Due date: Fri, 19/04, 23:59 (GMT+4:30)
Total points: 1
Description: First read the following documents:
PageRank description prepared by Kenneth Shum for "ENGG2012B Advanced Engineering Mathematics"
Wikipedia page for PageRank

Now, implement the below versions of pageRank using Python, C, C++ or any preferred low level programming language. You must prepare a report (In Persian or English) to describe you both implementations. This report should contain the problem statement, the algorithm, your implementation(s), the evaluation, and the conclusion part.
  1. Implement a simple PageRank algorithm described in Section 1 of the introduced document (PageRank description).
  2. Implement the more advanced PageRank algorithm which takes into account the dangling nodes (Section 2 of the document).
Note: It's obvious that you must create and initialize a directed Graph which presents your web-pages. Regarding the above document, this Graph can be presented by an N*N matrix when all fields are determined randomly. Moreover, You are allowed to use any open source Graph creation library.


Delivery: You should send your code and related document (all of them should be enclosed in one email) to the email address of Mr. Mohammad Mahmloudi before the deadline.

Quiz 01: Python primaries
Time: Sun, 14/04, 11:00 (GMT+4:30)
Total points: 1

Class session 01: Introduction and Primaries
Time: Fri, 19.04, 11:00-12:30
Lecturer: Dr. Ebrahim Ansari

Class session 02: Introduction and Primaries
Time: Fri, 19.04, 13:30-15:00
Lecturer: Dr. Ebrahim Ansari

Class session 03: Dictionary
Time: Fri, 19.04, 15:30-17:00
Lecturer: Dr. Ebrahim Ansari

Homework 03:
Due date: Sat, 27/04, 14:00 (GMT+4:30)
Total points: 1
Description:
1) Please write the answers of these exercises and questions:
- Page 33 of main slides, lecture1-intro.ppt
- Page 38 of main slides, lecture1-intro.ppt
- Page 40 of main slides, lecture1-intro.ppt
- Run two existing version sof English Language Stemmer (e.g. Porter) and write a report (document) about them which explains how to run them, includes a comparison and any other intresting issue.
Delivery: Please send your answers to Mr. Hamid Haghdoust.

Class session 04: Tolerant Retrieval
Time: Sat, 20.04, 17:00-18:30
Lecturer: Dr. Ebrahim Ansari

Class session 05: Compression
Time: Sun, 21.04, 10:00-11:30
Lecturer: Dr. Ebrahim Ansari

Class session 06: TF-IDF
Time: Sun, 21.04, 12:00-13:30
Lecturer: Dr. Ebrahim Ansari

Class session 07: TF-IDF
Time: Sun, 21.04, 14:00-15:30
Lecturer: Dr. Ebrahim Ansari

Class session 08: Vector Space
Time: Sun, 21.04, 15:00-16:30
OR Time: Mon, 22.04, 12:30-14:30
Lecturer: Dr. Ebrahim Ansari

Homework 04: Text Preprocessing
Due date: Sun, 12/05, 23:59 (GMT+4:30)
Total points: 2
Description: In the class
Delivery: TBD

Class session 09: Evaluation
Time: Tue, 23.04, 11:00-13:00
Lecturer: Dr. Ebrahim Ansari

Homework 05: Query Document - Project
Due date: Sun, 12/05, 23:59 (GMT+4:30)
Total points: 2
Description: In the class
queries needed for this project
relevance judgements needed for this project
Delivery: TBD

Homework 06: Paper reading/presentation
Due date: Fri, 31/05, 23:59 (GMT+4:30)
Total points: 4
Description: Each student should read and understand the paper, then prepare two/three pages document about it and may present the paper to me during the semester. As we discussed in the class, this homework should be done individually.
Behrouznia and Salehi: word2vec and fasttext
Hosseini and Heidari: behavioural study
Zakariapour and Rabioun: bipartite
Farhadi and Farahmandian: query suggestion
Yousefi and Vafayi: retrieval heuristics
Tahmasbi and Sabzi: mapreduce
Mehri and Mohammadi: approximation
SadreJahani and L. Moghaddam: inverted index
Alamdar and Azizi: compression
Nazeri and Mirzayi: twitter
Fallahi and A. Moghaddam: word2vec and morfessor

Delivery: TBD

TA session 03: Sun, 28/04/2019, 11:00-12:30
Lecturer: Mr. Hamid Haghdoust
Description: He will speak about the last homework.

Class session 10: Introduction to NLP
Time: Sun, 26.05, 11:00-12:30
Lecturer: Dr. Mehdi Boulouli

Class session 11: Introduction to NLP
Time: Sun, 26.05, 17:00-18:30
Lecturer: Dr. Mehdi Boulouli

Class session 12: Language models
Time: Tue, 28.05, 11:00-12:30
Lecturer: Dr. Mehdi Boulouli

Class session 13: Language models
Time: Tue, 28.05, 17:00-18:30
Lecturer: Dr. Mehdi Boulouli

Class session 14: Text classification
Time: Wed, 29.05, 17:00-18:30
Lecturer: Dr. Mehdi Boulouli

Class session 15: Sentiment Analysis
Time: Sat, 01.06, 17:00-18:30
Lecturer: Dr. Mehdi Boulouli

Class session 16: POS Tagging
Time: Sun, 02.06, 11:00-12:30
Lecturer: Dr. Mehdi Boulouli

Class session 17: Semantics
Time: Sun, 02.06, 17:00-18:30
Lecturer: Dr. Mehdi Boulouli

Class session 18: POS Tagging
Time: Mon, 03.06, TBD
Lecturer: Dr. Mehdi Boulouli

Class sessions 19: Paper presentation
Time: TBD
Lecturer: TBD

Homework 07: Natural Language Processing tasks
Due date: Fri, 07/06, 23:59 (GMT+4:30)
Total points: 2
Description: In the class by Dr. Bohlouli
Delivery: TBD

Homework 08: TBD (Language modeling)
Due date: Fri, 14/06, 23:59 (GMT+4:30)
Total points: 2
Description: In the class by Dr. Bohlouli
Delivery: TBD