A tool which helps both students and teachers to identify the overlapping courses within the course directory to avoid course redundacy within an instituition.
A tool which helps both students and teachers to identify the overlapping courses within the course directory to avoid offering redundant courses. It allows its users to search within their insituition's course directory simply by entering the name. To make things even simpler, the user does not even have to write the full name. Even short forms like intro(for introduction), b/w (for between) etc. are accepted. To add to this, there is also an option of finding courses similar to a set of courses if the user wishes for the same.
The core component of the system is to allow the user to find the courses similar to their selected course in the course directory. For this, the user may search by name of the course. The system will then suggest the courses which are likely to be similar to the course that the user input.
Extending this core functionality to multiple courses, we decided to implement the similarity matching on a set of courses as well. For this, we take the take the similarity metric of all the courses in the dataset with respect to each of the input courses. Then, for each course in the dataset, we take the sum of the similarity metric obtained with respect to each input. The courses which have the highest resulting sum are the ones we suggest.
We have added the functionality of correcting wrong spellings for the user. This spell correct, however, is not generic and is specific to our dataset. This ensures that our input is not corrected to a generic english sentence, which cannot be found in our dataset.
We have implemented a preprocessor which takes in the input as a file pointer object and extracts the data from the specified file. For this, we used python's Openpyxl to read excel files cell by cell, and identify the cells that contain possibly useful information. We then preprocess this information and neatly pack it into a Course object, which can further be used for various purposes.
To avoid unnecessary computation, and to make the system faster, we have implemented event specific updation. This means that the system will work with each new added course, but it will not compute the similarity metrics until a function using these metrics is called upon.
*This is a high fidelity prototype of the interface,an actual working interface is in the works
- Pandas
- Numpy
- Tensorflow
- Tensorflow Hub
- Sklearn
- Difflib
- Pickle
- Openpyxl
Ensure that you have atleast 50 MB of free space on your system.
To start, you need to initialize a Course_Loader object. This is where all the data will be stored, processed and similarity will be calculated. It can be initialized in the following way:
c = Course_Loader()
Once the object is loaded, you now have to add all the xlsx files containing the information about the courses. To add a new course, you simply have to call the addCourse function of your Course_Loader object. For this you need to retrive the directory of the .xlsx files containing your data. In our case, the files are located in the Data folder in the same directory as our python file.
path = os.getcwd()
path = str(path) + "\\Data\\"
csv_files = glob.glob(os.path.join(path, "*.xlsx")) #Retrieving the directory of the data
for f in csv_files:
c.addCourse(f) #Adding each course using the addCourse function
Now that all the courses have been added, you can use the system. Note that you may add a new course at any given time and the system will update itself to include this in the results. Now, to make a query, you have to call the combine_recommendations function of your Course_Loader object. The input to this function will be a list of strings, here each string denotes the course that you wish to input (see the examples for more clarity).
It would look something like this:
query = ["Course-1","Course-2","Course-3", ...] #The courses which you wish to search
c.combine_recommendations(query)
Upon running this, you would get the desired output in the form of a pandas dataframe. The output would contain the top 10 courses similar to your entered courses, in the ranked order with first being the most similar.
None so far. More extensive testing needed on a larger dataset.