WolframAlpha.com
WolframCloud.com
All Sites & Public Resources...
Products & Services
Wolfram|One
Mathematica
Wolfram|Alpha Notebook Edition
Finance Platform
System Modeler
Wolfram Player
Wolfram Engine
WolframScript
Enterprise Private Cloud
Application Server
Enterprise Mathematica
Wolfram|Alpha Appliance
Enterprise Solutions
Corporate Consulting
Technical Consulting
Wolfram|Alpha Business Solutions
Resource System
Data Repository
Neural Net Repository
Function Repository
Wolfram|Alpha
Wolfram|Alpha Pro
Problem Generator
API
Data Drop
Products for Education
Mobile Apps
Wolfram Player
Wolfram Cloud App
Wolfram|Alpha for Mobile
Wolfram|Alpha-Powered Apps
Services
Paid Project Support
Wolfram U
Summer Programs
All Products & Services »
Technologies
Wolfram Language
Revolutionary knowledge-based programming language.
Wolfram Cloud
Central infrastructure for Wolfram's cloud products & services.
Wolfram Science
Technology-enabling science of the computational universe.
Wolfram Notebooks
The preeminent environment for any technical workflows.
Wolfram Engine
Software engine implementing the Wolfram Language.
Wolfram Natural Language Understanding System
Knowledge-based broadly deployed natural language.
Wolfram Data Framework
Semantic framework for real-world data.
Wolfram Universal Deployment System
Instant deployment across cloud, desktop, mobile, and more.
Wolfram Knowledgebase
Curated computable knowledge powering Wolfram|Alpha.
All Technologies »
Solutions
Engineering, R&D
Aerospace & Defense
Chemical Engineering
Control Systems
Electrical Engineering
Image Processing
Industrial Engineering
Mechanical Engineering
Operations Research
More...
Finance, Statistics & Business Analysis
Actuarial Sciences
Bioinformatics
Data Science
Econometrics
Financial Risk Management
Statistics
More...
Education
All Solutions for Education
Trends
Machine Learning
Multiparadigm Data Science
Internet of Things
High-Performance Computing
Hackathons
Software & Web
Software Development
Authoring & Publishing
Interface Development
Web Development
Sciences
Astronomy
Biology
Chemistry
More...
All Solutions »
Learning & Support
Learning
Wolfram Language Documentation
Fast Introduction for Programmers
Wolfram U
Videos & Screencasts
Wolfram Language Introductory Book
Webinars & Training
Summer Programs
Books
Need Help?
Support FAQ
Wolfram Community
Contact Support
Premium Support
Paid Project Support
Technical Consulting
All Learning & Support »
Company
About
Company Background
Wolfram Blog
Events
Contact Us
Work with Us
Careers at Wolfram
Internships
Other Wolfram Language Jobs
Initiatives
Wolfram Foundation
MathWorld
Computer-Based Math
A New Kind of Science
Wolfram Technology for Hackathons
Student Ambassador Program
Wolfram for Startups
Demonstrations Project
Wolfram Innovator Awards
Wolfram + Raspberry Pi
Summer Programs
More...
All Company »
Search
WOLFRAM COMMUNITY
Connect with users of Wolfram technologies to learn, solve problems and share ideas
Join
Sign In
Dashboard
Groups
People
Message Boards
Answer
(
Unmark
)
Mark as an Answer
GROUPS:
Staff Picks
Data Science
Education
Physics
Graphics and Visualization
Wolfram Language
Computational Linguistics
Natural Language Processing
Wolfram Summer School
3
Amir Sadeghi
[WSS21] Assembling a global database of university course listings
Amir Sadeghi
Posted
11 months ago
1608 Views
|
1 Reply
|
3 Total Likes
Follow this post
|
Assembling a global database of university course listings
by
Amir Hosein Sadeghi Isfahani
(He/him)
University of Waterloo
The “global database of university course listings” shows interconnections between programs, courses, and course topics. The database offers benefits for both students and university administrators. While a global database allows a student to have a big picture of their program and stay motivated by knowing the ties between courses, it also enables universities to optimize their programs and courses and effectively run student exchange programs with their partner universities. Using the course listing at University of Waterloo, we are currently working to understand the interconnection between the course topics in the physics courses. Our initial goal is to have an intelligent agent that tells us how much the topics in two physics courses overlaps. This project is the first step towards our ambitious vision for fully-online and accessible degree programs based on the elements of computational thinking.
Introduction
Background
Students use course listings (or catalogs) to choose their courses of interest as they pursue their university degrees. Academic and/or program advisors usually help students in course selection, enabling students to refine and realize an academic plan that is based on and aligned with their goals and aspirations. On the other hand, universities compose course listings to efficiently manage their resources, standardize student admissions, create and facilitate student exchange programs, and handle student transfers.
Course listings are usually published once or twice per year, showing how frequently different courses are offered in each term or semester. While a course listing usually varies from one university to another, a typical one provides information about various courses in each subject as well as course requisites, components, descriptions, and possible instructors. Additionally, subject, course and component names differ from one listing to another.
From an administrative standpoint, the main challenge for comparability of course listings is the name discrepancy, not the structural difference. Therefore, the advantages of course listing are hindered and universities usually consider some procedures to resolve this issue. In particular, universities generally go through three stages to approve the courses that students took in colleges or other universities; they first check course descriptions, then course syllabi, and finally intended learning outcomes (ILOs). Hence, this issue raises a practical question about the applicability of course listings in a global landscape; if a global database of university course listings is designed and implemented, can it resolve this issue? What are other advantages of such a database? What is the roadmap toward such a database?
To start to answer the aforementioned questions, a global database of university course listings is created from the undergraduate (or graduate) course listings and other course-related data of participating universities. This database is trained to guess how much two randomly chosen courses, from similar or different subjects at the same university or two different universities, overlap in their topics, syllabi, and ILOs. Therefore, it can resolve the administrative issue by intelligently checking the course descriptions or any other course-related data that are fed into it. Being aware the interconnections between courses and/or programs at such a high level of granularity, universities can have customized/personalized programs without adding new programs and/or courses to their workflows. They also can optimize their programs and courses, reducing their costs. Additionally, this database benefits instructors and students by allowing them to know the overlaps and ties between courses. Using this database, instructors can tie their course materials to other courses students take in their projects, motivating students to learn the materials. Students can see a big picture of what they learn at their programs, discovering how different courses satisfy their interests and contribute to their skill-sets.
Finally, this global database has some advantages for the paradigm of computational thinking. it allows us to design a program in which the material in each course interconnects with materials in other ones. Moreover, if the design of such a program is integrated with elements of computation literacy, the program empowers its students with inherently-innovative computational thinking and prepares them for multi- and/or inter-disciplinary projects and working with rapid-advancing technologies. This universal database enables us to visually inspect the knowledge network of programs/course/course topics and extract some features of the space of university programs/course/course topics. These features are then used to design and implement a holistic approach to program and course design. While this holistic approach gives students knowledge and freedom to reroute their academic paths, it also provides instructors with knowledge and flexibility they need to design/revise their courses. This database can be used to design online courses and make their content interconnected. Moreover, this project itself can be expanded and used to define fully-online programs based on computation knowledge, furthering fair and accessible higher education all around the world.
Implementation and Analysis
Outline
As mentioned, the structure and contents of course listings change across universties. Additionally, the ways universities compile and upload course listings on their webpages are not similar. Consequently, the process of web scrapping, converting the mined data into a valid and structured dataset, and storing in an standard and secure manner is itself a challenge.
In the rest of this section, two separate dataset are constructed from the undergraduate course listings at UWaterloo and MIT. While it is possible to create a dataset from the whole undergraduate listing at both of these universities (In fact, a code snippet is written that produces a dataset from UWaterloo’s undergraduate course listing), ours effort is focused on
physics
subject. In this way, the analysing, testing, and troubleshooting of the project is faster and more reliable.
We start with scrapping the undergraduate course listing at UWaterloo. For this, we show a sample course information at UWaterloo (undergraduate) listing. This image reveals the structure and content of each course profile at UWaterloo listing. Next, we explore the XML source codes of UWaterloo course listing and extract some XML patterns by which we can build a dataset from the publicly-available UWaterloo course listing. A similar procedure can be repeated for MIT undergraduate course listing. However, it is more difficult to find XML patterns in MIT XML source codes. For this, we focus on MIT physics undergraduate course listing and create a dataset from its XML source code.
Having clean datasets, we generate some graphs that show the relations between courses in four different subjects at UWaterloo. These graphs provide useful insights into the the interconnections between different courses at the subjective level.
In the rest of notebook, we focus on finding the interconnections between different courses, using their course descriptions. For this, we work with physics undergraduate course listings at UWaterloo and MIT. These has two advantageous: We focus on a subject that we are familiar with its course topics and curricula. Additionally, we work with a prototype that eases analysing, testing, and troubleshooting of the project faster and more reliable.
To achieve or goal, we explain, implement, and discus different strategies in text analysis. Finally, we create a simple course recommendation system, based on the “edit distance”. For a given physics course at MIT, this system suggests the three contextually nearest physics courses at UWaterloo to that course.
Course listing at University of Waterloo
Design of undergraduate course listing at University of Waterloo
Some auxiliary functions for styling fonts, figures with captions,
I
n
[
]
:
=
s
t
y
l
e
T
i
t
l
e
[
t
i
t
l
e
_
S
t
r
i
n
g
]
:
=
S
t
y
l
e
[
t
i
t
l
e
,
1
4
,
F
o
n
t
F
a
m
i
l
y
"
H
e
l
v
e
t
i
c
a
"
]
;
s
t
y
l
e
F
o
o
t
n
o
t
e
[
f
o
o
t
n
o
t
e
_
S
t
r
i
n
g
]
:
=
S
t
y
l
e
[
f
o
o
t
n
o
t
e
,
1
0
,
I
t
a
l
i
c
,
F
o
n
t
F
a
m
i
l
y
"
H
e
l
v
e
t
i
c
a
"
]
;
The help webpage of UWaterloo listing
I
n
[
]
:
=
S
e
t
O
p
t
i
o
n
s
[
E
v
a
l
u
a
t
i
o
n
N
o
t
e
b
o
o
k
[
]
,
D
o
c
k
e
d
C
e
l
l
s
-
>
N
o
n
e
]
S
e
t
D
i
r
e
c
t
o
r
y
[
N
o
t
e
b
o
o
k
D
i
r
e
c
t
o
r
y
[
]
]
;
l
i
s
t
G
u
i
d
e
L
i
n
k
=
"
h
t
t
p
s
:
/
/
u
w
a
t
e
r
l
o
o
.
c
a
/
r
e
g
i
s
t
r
a
r
/
r
e
g
i
s
t
e
r
i
n
g
-
c
o
u
r
s
e
s
/
u
n
d
e
r
s
t
a
n
d
i
n
g
-
c
o
u
r
s
e
-
d
e
s
c
r
i
p
t
i
o
n
-
l
i
s
t
i
n
g
s
"
;
A sample course description at UWaterloo (The first image in the list of imported images).
O
u
t
[
]
=
S
a
m
p
l
e
c
o
u
r
s
e
i
n
f
o
r
m
a
t
i
o
n
a
t
U
W
a
t
e
r
l
o
o
r
e
t
r
i
e
v
e
d
o
n
T
u
e
1
3
J
u
l
2
0
2
1
f
r
o
m
U
n
i
v
e
r
s
i
t
y
o
f
W
a
t
e
r
l
o
o
'
s
R
e
g
i
s
t
r
a
r
O
f
f
i
c
e
W
e
b
p
a
g
e
.
Parsing course requisites: At UWaterloo, each course has zero or more pre-, co-, or anti-requisite(s). Moreover, some information such as the minimum grade for taking a course is usually associated with a string that contain the requisite information.
requisiteParser
parses requisite string and return a list of pre-, co-, or anti-requisite(s).
I
n
[
]
:
=
r
e
q
u
i
s
i
t
e
P
a
r
s
e
r
[
r
e
q
_
S
t
r
i
n
g
]
:
=
M
o
d
u
l
e
[
{
m
i
n
G
r
a
d
D
r
o
p
e
d
=
S
t
r
i
n
g
R
e
p
l
a
c
e
[
S
t
r
i
n
g
D
e
l
e
t
e
[
r
e
q
,
"
(
"
~
~
S
h
o
r
t
e
s
t
[
_
_
_
]
~
~
"
%
"
~
~
S
h
o
r
t
e
s
t
[
_
_
_
]
~
~
"
)
"
]
(
*
D
e
l
e
t
i
n
g
t
h
e
e
x
t
r
a
i
n
f
o
a
b
o
u
t
m
i
n
g
r
a
d
e
*
)
,
{
"
o
r
"
|
"
/
"
}
-
>
"
"
]
}
(
*
R
e
p
l
a
c
i
n
g
d
i
f
f
e
r
e
n
t
c
o
u
r
s
e
s
e
p
e
r
a
t
o
r
s
w
i
t
h
"
"
*
)
,
F
l
a
t
t
e
n
[
S
t
r
i
n
g
C
a
s
e
s
[
m
i
n
G
r
a
d
D
r
o
p
e
d
,
(
l
e
t
t
e
r
s
:
R
e
p
e
a
t
e
d
[
L
e
t
t
e
r
C
h
a
r
a
c
t
e
r
,
{
2
,
I
n
f
i
n
i
t
y
}
]
.
.
~
~
(
"
"
|
"
,
"
)
~
~
c
o
m
m
a
S
e
p
a
r
a
t
e
d
N
u
m
b
e
r
s
:
(
(
(
D
i
g
i
t
C
h
a
r
a
c
t
e
r
.
.
~
~
L
e
t
t
e
r
C
h
a
r
a
c
t
e
r
)
|
(
D
i
g
i
t
C
h
a
r
a
c
t
e
r
.
.
)
)
|
"
"
|
"
,
"
|
"
;
"
)
.
.
)
/
;
U
p
p
e
r
C
a
s
e
Q
[
l
e
t
t
e
r
s
]
:
>
O
u
t
e
r
[
S
t
r
i
n
g
J
o
i
n
,
{
l
e
t
t
e
r
s
}
,
S
t
r
i
n
g
C
a
s
e
s
[
c
o
m
m
a
S
e
p
a
r
a
t
e
d
N
u
m
b
e
r
s
,
D
i
g
i
t
C
h
a
r
a
c
t
e
r
.
.
~
~
L
e
t
t
e
r
C
h
a
r
a
c
t
e
r
|
D
i
g
i
t
C
h
a
r
a
c
t
e
r
.
.
]
]
(
*
E
a
c
h
g
r
o
u
p
o
f
a
s
e
q
u
e
n
c
e
o
f
2
o
r
m
o
r
e
u
p
p
e
r
c
a
s
e
l
e
t
t
e
r
s
f
o
l
l
o
w
e
d
1
o
r
m
o
r
e
s
e
q
u
e
c
e
o
f
2
o
r
m
o
r
e
d
i
g
i
t
s
a
r
e
r
e
l
a
t
e
d
t
o
o
n
e
s
u
b
j
e
c
t
*
)
]
]
]
;
Based on the structure of UWaterloo course data in the above figure,
parseCourse
extract all different information from the XMLBlock of a course.
I
n
[
]
:
=
p
a
r
s
e
C
o
u
r
s
e
[
c
o
u
r
s
e
_
X
M
L
E
l
e
m
e
n
t
]
:
=
A
s
s
o
c
i
a
t
i
o
n
[
(
*
c
o
u
r
s
e
c
o
d
e
,
s
u
b
j
e
c
t
c
o
d
e
,
c
a
t
e
g
o
r
y
n
a
m
e
,
c
o
m
p
o
n
e
n
t
s
,
u
n
i
t
*
)
F
i
r
s
t
C
a
s
e
[
c
o
u
r
s
e
[
[
3
]
]
,
X
M
L
E
l
e
m
e
n
t
[
"
d
i
v
"
,
{
"
c
l
a
s
s
"
"
d
i
v
T
a
b
l
e
C
e
l
l
"
}
,
{
X
M
L
E
l
e
m
e
n
t
[
"
s
t
r
o
n
g
"
,
{
}
,
{
X
M
L
E
l
e
m
e
n
t
[
"
a
"
,
{
"
s
h
a
p
e
"
"
r
e
c
t
"
,
"
n
a
m
e
"
c
o
u
r
s
e
S
h
o
r
t
C
o
d
e
_
S
t
r
i
n
g
}
,
{
}
]
,
c
o
u
r
s
e
L
o
n
g
C
o
d
e
_
S
t
r
i
n
g
}
]
}
]
:
>
<
|
"
C
o
d
e
"
-
>
S
t
r
i
n
g
T
r
i
m
[
c
o
u
r
s
e
S
h
o
r
t
C
o
d
e
]
,
"
S
u
b
j
e
c
t
C
o
d
e
"
-
>
S
t
r
i
n
g
T
r
i
m
[
P
a
r
t
[
S
t
r
i
n
g
S
p
l
i
t
[
c
o
u
r
s
e
L
o
n
g
C
o
d
e
]
,
1
]
]
,
"
C
a
t
a
l
o
g
N
u
m
b
e
r
"
-
>
S
t
r
i
n
g
T
r
i
m
[
P
a
r
t
[
S
t
r
i
n
g
S
p
l
i
t
[
c
o
u
r
s
e
L
o
n
g
C
o
d
e
]
,
2
]
]
,
A
s
s
o
c
i
a
t
i
o
n
[
#
-
>
T
r
u
e
&
/
@
S
t
r
i
n
g
S
p
l
i
t
[
P
a
r
t
[
S
t
r
i
n
g
S
p
l
i
t
[
c
o
u
r
s
e
L
o
n
g
C
o
d
e
]
,
3
]
,
"
,
"
]
]
,
"
U
n
i
t
"
-
>
I
n
t
e
r
p
r
e
t
e
r
[
"
N
u
m
b
e
r
"
]
[
P
a
r
t
[
S
t
r
i
n
g
S
p
l
i
t
[
c
o
u
r
s
e
L
o
n
g
C
o
d
e
]
,
4
]
]
|
>
]
,
(
*
c
o
u
r
s
e
I
D
*
)
"
I
D
"
-
>
F
i
r
s
t
C
a
s
e
[
c
o
u
r
s
e
[
[
3
]
]
,
X
M
L
E
l
e
m
e
n
t
[
_
_
_
,
{
"
c
l
a
s
s
"
"
d
i
v
T
a
b
l
e
C
e
l
l
c
r
s
e
i
d
"
}
,
{
c
o
u
r
s
e
I
D
S
t
r
i
n
g
_
S
t
r
i
n
g
}
]
:
>
F
i
r
s
t
@
S
t
r
i
n
g
C
a
s
e
s
[
c
o
u
r
s
e
I
D
S
t
r
i
n
g
,
D
i
g
i
t
C
h
a
r
a
c
t
e
r
.
.
]
]
,
(
*
c
o
u
r
s
e
n
a
m
e
*
)
"
N
a
m
e
"
-
>
F
i
r
s
t
C
a
s
e
[
c
o
u
r
s
e
[
[
3
]
]
,
X
M
L
E
l
e
m
e
n
t
[
"
d
i
v
"
,
{
"
c
l
a
s
s
"
"
d
i
v
T
a
b
l
e
C
e
l
l
c
o
l
s
p
a
n
-
2
"
}
,
{
X
M
L
E
l
e
m
e
n
t
[
"
s
t
r
o
n
g
"
,
{
}
,
{
c
o
u
r
s
e
N
a
m
e
_
S
t
r
i
n
g
}
]
}
]
:
>
S
t
r
i
n
g
T
r
i
m
[
c
o
u
r
s
e
N
a
m
e
]
]
,
(
*
c
o
u
r
s
e
d
e
s
c
r
i
p
t
i
o
n
*
)
"
D
e
s
c
r
i
p
t
i
o
n
"
-
>
F
i
r
s
t
C
a
s
e
[
c
o
u
r
s
e
[
[
3
]
]
,
X
M
L
E
l
e
m
e
n
t
[
"
d
i
v
"
,
{
"
c
l
a
s
s
"
"
d
i
v
T
a
b
l
e
C
e
l
l
c
o
l
s
p
a
n
-
2
"
}
,
{
c
o
u
r
s
e
D
e
s
c
_
S
t
r
i
n
g
}
]
:
>
S
t
r
i
n
g
T
r
i
m
[
S
t
r
i
n
g
S
p
l
i
t
[
c
o
u
r
s
e
D
e
s
c
,
"
[
"
]
[
[
1
]
]
]
]
(
*
a
c
o
u
r
s
e
d
e
s
c
r
i
p
t
i
o
n
a
t
U
W
a
t
e
r
l
o
o
e
n
d
s
w
i
t
h
t
h
i
s
p
a
t
t
e
r
n
:
.
.
.
[
I
n
f
o
r
a
m
a
t
i
o
n
o
n
t
h
e
t
e
r
m
s
i
n
w
h
i
c
h
a
c
o
u
r
s
e
o
f
f
e
r
e
d
]
.
*
)
,
(
*
c
o
u
r
s
e
n
o
t
e
s
*
)
"
N
o
t
e
"
-
>
F
i
r
s
t
C
a
s
e
[
c
o
u
r
s
e
[
[
3
]
]
,
X
M
L
E
l
e
m
e
n
t
[
"
d
i
v
"
,
{
"
c
l
a
s
s
"
"
d
i
v
T
a
b
l
e
C
e
l
l
c
o
l
s
p
a
n
-
2
"
}
,
{
n
o
t
e
_
S
t
r
i
n
g
}
]
/
;
S
t
r
i
n
g
C
o
n
t
a
i
n
s
Q
[
n
o
t
e
,
"
N
o
t
e
:
"
]
:
>
S
t
r
i
n
g
T
r
i
m
[
n
o
t
e
]
,
M
i
s
s
i
n
g
[
"
N
o
t
A
p
p
l
i
c
a
b
l
e
"
]
,
A
l
l
]
,
(
*
p
r
e
-
r
e
q
u
i
s
i
t
e
(
s
)
*
)
"
P
r
e
r
e
q
u
i
s
i
t
e
"
-
>
F
i
r
s
t
C
a
s
e
[
c
o
u
r
s
e
[
[
3
]
]
,
e
l
e
m
e
n
t
:
X
M
L
E
l
e
m
e
n
t
[
_
_
_
,
_
_
_
,
{
s
_
S
t
r
i
n
g
}
]
/
;
S
t
r
i
n
g
C
o
n
t
a
i
n
s
Q
[
s
,
"
P
r
e
r
e
q
"
]
:
>
r
e
q
u
i
s
i
t
e
P
a
r
s
e
r
[
P
a
r
t
[
S
t
r
i
n
g
S
p
l
i
t
[
s
,
"
:
"
]
,
2
]
]
,
M
i
s
s
i
n
g
[
"
N
o
t
A
p
p
l
i
c
a
b
l
e
"
]
,
A
l
l
]
,
(
*
c
o
-
r
e
q
u
i
s
i
t
e
(
s
)
*
)
"
C
o
r
e
q
u
i
s
i
t
e
"
-
>
F
i
r
s
t
C
a
s
e
[
c
o
u
r
s
e
[
[
3
]
]
,
e
l
e
m
e
n
t
:
X
M
L
E
l
e
m
e
n
t
[
_
_
_
,
_
_
_
,
{
s
_
S
t
r
i
n
g
}
]
/
;
S
t
r
i
n
g
C
o
n
t
a
i
n
s
Q
[
s
,
"
C
o
r
e
q
"
]
:
>
r
e
q
u
i
s
i
t
e
P
a
r
s
e
r
[
P
a
r
t
[
S
t
r
i
n
g
S
p
l
i
t
[
s
,
"
:
"
]
,
2
]
]
,
M
i
s
s
i
n
g
[
"
N
o
t
A
p
p
l
i
c
a
b
l
e
"
]
,
A
l
l
]
,
(
*
a
n
t
i
-
r
e
q
u
i
s
i
t
e
(
s
)
*
)
"
A
n
t
i
r
e
q
u
i
s
i
t
e
"
-
>
F
i
r
s
t
C
a
s
e
[
c
o
u
r
s
e
[
[
3
]
]
,
e
l
e
m
e
n
t
:
X
M
L
E
l
e
m
e
n
t
[
_
_
_
,
_
_
_
,
{
s
_
S
t
r
i
n
g
}
]
/
;
S
t
r
i
n
g
C
o
n
t
a
i
n
s
Q
[
s
,
"
A
n
t
i
r
e
q
"
]
:
>
r
e
q
u
i
s
i
t
e
P
a
r
s
e
r
[
P
a
r
t
[
S
t
r
i
n
g
S
p
l
i
t
[
s
,
"
:
"
]
,
2
]
]
,
M
i
s
s
i
n
g
[
"
N
o
t
A
p
p
l
i
c
a
b
l
e
"
]
,
A
l
l
]
,
(
*
o
f
f
e
r
e
d
o
n
l
i
n
e
*
)
"
O
f
f
e
r
e
d
O
n
l
i
n
e
"
-
>
F
i
r
s
t
C
a
s
e
[
c
o
u
r
s
e
[
[
3
]
]
,
e
l
e
m
e
n
t
:
X
M
L
E
l
e
m
e
n
t
[
_
_
_
,
_
_
_
,
{
s
_
S
t
r
i
n
g
}
]
/
;
S
t
r
i
n
g
C
o
n
t
a
i
n
s
Q
[
s
,
"
o
f
f
e
r
e
d
O
n
l
i
n
e
"
]
-
>
T
r
u
e
,
M
i
s
s
i
n
g
[
"
N
o
t
A
p
p
l
i
c
a
b
l
e
"
]
,
A
l
l
]
]
;
This is the template for subjective course listing at UWaterloo:
https://ucalendar.uwaterloo.ca/*AcademicCalender:1920*/COURSE/course-*SubjectCode:PHYS*.html
where, for instance 1920 means academic 2019-20.
subjectLinkUW
generates the link for a subject listing at a given academic calender.
I
n
[
]
:
=
s
u
b
j
e
c
t
L
i
n
k
U
W
[
a
c
a
d
e
m
i
c
C
a
l
_
I
n
t
e
g
e
r
,
s
u
b
j
e
c
t
C
o
d
e
_
S
t
r
i
n
g
]
:
=
"
h
t
t
p
s
:
/
/
u
c
a
l
e
n
d
a
r
.
u
w
a
t
e
r
l
o
o
.
c
a
/
"
<
>
T
o
S
t
r
i
n
g
[
a
c
a
d
e
m
i
c
C
a
l
]
<
>
"
/
C
O
U
R
S
E
/
"
<
>
"
c
o
u
r
s
e
-
"
<
>
s
u
b
j
e
c
t
C
o
d
e
<
>
"
.
h
t
m
l
"
;
As mention, the XTML versions of the course listings do not follow a standard protocol. Here,
importXML
fixes a coding problem in HTML source files of UWaterloo course listings.
I
n
[
]
:
=
i
m
p
o
r
t
X
M
L
[
w
e
b
p
a
g
e
_
S
t
r
i
n
g
]
:
=
I
m
p
o
r
t
S
t
r
i
n
g
[
S
t
r
i
n
g
R
e
p
l
a
c
e
[
I
m
p
o
r
t
[
w
e
b
p
a
g
e
,
"
T
e
x
t
"
]
,
"
<
"
~
~
n
:
D
i
g
i
t
C
h
a
r
a
c
t
e
r
.
.
~
~
"
>
"
:
>
(
"
&
l
t
;
"
<
>
n
<
>
"
&
g
t
;
"
)
]
,
{
"
H
T
M
L
"
,
"
X
M
L
O
b
j
e
c
t
"
}
]
;
coursesInSubject
extract all the courses in a subject listing.
I
n
[
]
:
=
c
o
u
r
s
e
s
I
n
S
u
b
j
e
c
t
[
s
u
b
j
e
c
t
X
M
L
_
]
:
=
C
a
s
e
s
[
s
u
b
j
e
c
t
X
M
L
,
X
M
L
E
l
e
m
e
n
t
[
"
d
i
v
"
,
{
_
_
_
,
"
c
l
a
s
s
"
"
d
i
v
T
a
b
l
e
"
,
_
_
_
}
,
{
_
_
_
}
]
,
A
l
l
]
subjectListGenUW
downloads the subjective course at a given academic year, exports as a dataset, and also returns it as a dataset
I
n
[
]
:
=
s
u
b
j
e
c
L
i
s
t
i
n
g
U
W
[
a
c
a
d
e
m
i
c
C
a
l
_
I
n
t
e
g
e
r
,
s
u
b
j
e
c
t
C
o
d
e
_
S
t
r
i
n
g
]
:
=
M
o
d
u
l
e
[
{
o
u
t
p
u
t
N
a
m
e
=
S
t
r
i
n
g
T
e
m
p
l
a
t
e
[
"
U
W
a
t
e
r
l
o
o
_
u
n
d
e
r
g
r
a
d
_
`
c
o
d
e
`
_
l
i
s
t
_
`
y
e
a
r
`
.
w
x
f
"
]
[
<
|
"
c
o
d
e
"
-
>
s
u
b
j
e
c
t
C
o
d
e
,
"
y
e
a
r
"
-
>
a
c
a
d
e
m
i
c
C
a
l
|
>
]
,
s
u
b
j
e
c
t
L
i
s
t
i
n
g
}
,
s
u
b
j
e
c
t
L
i
s
t
i
n
g
=
D
a
t
a
s
e
t
@
K
e
y
U
n
i
o
n
@
(
p
a
r
s
e
C
o
u
r
s
e
/
@
c
o
u
r
s
e
s
I
n
S
u
b
j
e
c
t
@
i
m
p
o
r
t
X
M
L
@
s
u
b
j
e
c
t
L
i
n
k
U
W
[
a
c
a
d
e
m
i
c
C
a
l
,
s
u
b
j
e
c
t
C
o
d
e
]
)
;
E
x
p
o
r
t
[
S
e
t
D
i
r
e
c
t
o
r
y
[
N
o
t
e
b
o
o
k
D
i
r
e
c
t
o
r
y
[
]
]
;
o
u
t
p
u
t
N
a
m
e
,
s
u
b
j
e
c
t
L
i
s
t
i
n
g
]
;
s
u
b
j
e
c
t
L
i
s
t
i
n
g
]
;
The list of different course components at UWaterloo. The component list is used as guide for UWaterloo course listing.
O
u
t
[
]
=
Physics, Applied Mathematics, Mathematics, and Computer Science course listings at UWaterloo in the academic year 2021-22 is created and exported.
O
u
t
[
]
=
U
W
a
t
e
r
l
o
o
p
h
y
s
i
c
s
u
n
d
e
r
g
r
a
d
u
a
t
e
c
o
u
r
s
e
l
i
s
t
i
n
g
H
e
a
d
:
D
a
t
a
s
e
t
B
y
t
e
c
o
u
n
t
:
2
9
8
9
8
4
The dataset of UWaterloo undergraduate course listing
uwListingUndergrad
downloads the UWaterloo undergraduate listing at a given academic, exports as a dataset, and also returns it as a dataset
I
n
[
]
:
=
u
w
U
n
d
e
r
g
r
a
d
L
i
s
t
i
n
g
[
a
c
a
d
e
m
i
c
C
a
l
_
I
n
t
e
g
e
r
]
:
=
M
o
d
u
l
e
[
{
l
i
s
t
R
e
p
o
=
I
m
p
o
r
t
[
"
h
t
t
p
s
:
/
/
u
g
r
a
d
c
a
l
e
n
d
a
r
.
u
w
a
t
e
r
l
o
o
.
c
a
/
p
a
g
e
/
C
o
u
r
s
e
-
D
e
s
c
r
i
p
t
i
o
n
s
-
I
n
d
e
x
"
,
{
"
H
T
M
L
"
,
"
D
a
t
a
"
}
]
,
f
a
c
u
l
t
i
e
s
,
s
u
b
j
e
c
t
s
,
s
u
b
j
e
c
t
s
W
i
t
h
i
n
F
a
c
u
l
t
i
e
s
,
u
w
L
i
s
t
i
n
g
}
,
f
a
c
u
l
t
i
e
s
=
D
a
t
a
s
e
t
@
C
a
s
e
s
[
l
i
s
t
R
e
p
o
,
{
f
a
c
u
l
t
y
C
o
d
e
_
S
t
r
i
n
g
,
f
a
c
u
l
t
y
N
a
m
e
_
S
t
r
i
n
g
}
/
;
S
t
r
i
n
g
L
e
n
g
t
h
[
f
a
c
u
l
t
y
C
o
d
e
]
=
=
3
:
>
<
|
"
F
a
c
u
l
t
y
C
o
d
e
"
-
>
f
a
c
u
l
t
y
C
o
d
e
,
"
F
a
c
u
l
t
y
N
a
m
e
"
-
>
f
a
c
u
l
t
y
N
a
m
e
|
>
,
A
l
l
]
;
s
u
b
j
e
c
t
s
=
D
a
t
a
s
e
t
@
C
a
s
e
s
[
l
i
s
t
R
e
p
o
,
{
s
u
b
j
e
c
t
_
S
t
r
i
n
g
,
c
o
d
e
_
S
t
r
i
n
g
,
f
a
c
u
l
t
y
C
o
d
e
_
S
t
r
i
n
g
,
_
_
_
}
/
;
S
t
r
i
n
g
L
e
n
g
t
h
[
f
a
c
u
l
t
y
C
o
d
e
]
=
=
3
:
>
<
|
"
S
u
b
j
e
c
t
N
a
m
e
"
-
>
s
u
b
j
e
c
t
,
"
S
u
b
j
e
c
t
C
o
d
e
"
-
>
c
o
d
e
,
"
F
a
c
u
l
t
y
C
o
d
e
"
-
>
f
a
c
u
l
t
y
C
o
d
e
|
>
,
A
l
l
]
;
s
u
b
j
e
c
t
s
W
i
t
h
i
n
F
a
c
u
l
t
i
e
s
=
J
o
i
n
A
c
r
o
s
s
[
f
a
c
u
l
t
i
e
s
,
s
u
b
j
e
c
t
s
,
"
F
a
c
u
l
t
y
C
o
d
e
"
]
;
u
w
L
i
s
t
i
n
g
=
J
o
i
n
A
c
r
o
s
s
[
s
u
b
j
e
c
t
s
W
i
t
h
i
n
F
a
c
u
l
t
i
e
s
,
K
e
y
U
n
i
o
n
@
(
p
a
r
s
e
C
o
u
r
s
e
/
@
c
o
u
r
s
e
s
I
n
S
u
b
j
e
c
t
@
s
u
b
j
e
c
t
s
W
i
t
h
i
n
F
a
c
u
l
t
i
e
s
[
A
l
l
,
i
m
p
o
r
t
X
M
L
@
s
u
b
j
e
c
t
L
i
n
k
U
W
[
a
c
a
d
e
m
i
c
C
a
l
,
#
S
u
b
j
e
c
t
C
o
d
e
]
&
]
)
,
"
S
u
b
j
e
c
t
C
o
d
e
"
]
;
E
x
p
o
r
t
[
S
e
t
D
i
r
e
c
t
o
r
y
[
N
o
t
e
b
o
o
k
D
i
r
e
c
t
o
r
y
[
]
]
,
S
t
r
i
n
g
T
e
m
p
l
a
t
e
[
"
U
W
a
t
e
r
l
o
o
_
u
n
d
e
r
g
r
a
d
_
l
i
s
t
_
`
y
e
a
r
`
.
w
x
f
"
]
[
<
|
"
y
e
a
r
"
-
>
a
c
a
d
e
m
i
c
C
a
l
|
>
]
,
u
w
L
i
s
t
i
n
g
]
;
u
w
L
i
s
t
i
n
g
]
The UWaterloo undergraduate listing in the academic year 2021-22
I
n
[
]
:
=
u
w
U
n
d
e
r
g
r
a
d
2
0
2
1
=
u
w
U
n
d
e
r
g
r
a
d
L
i
s
t
i
n
g
[
2
1
2
2
]
;
Course listing at MIT
The HTML sources of the course listing webpages at MIT are less structured than UWaterloo, so it is more difficult to extract information in a systematic way. Moreover, it is not possible to combine its pipeline with UWaterloo’s one. Consequently, MIT course listing has only information about course codes, names, and topics. Moreover, it is unclear how to access the archive of the course listings for previous years .
MIT physics undergraduate course listing
In comparison to UWaterloo, functions here are shorter and smaller in number. The focus is on the Physics course listing with subject code 8.
Based on the structure of MIT course description,
parseCourseMIT
extract all different information from the XMLBlock of a course.
I
n
[
]
:
=
p
a
r
s
e
C
o
u
r
s
e
M
I
T
[
c
o
u
r
s
e
_
]
:
=
A
s
s
o
c
i
a
t
i
o
n
[
(
*
c
o
u
r
s
e
c
o
d
e
*
)
"
C
o
d
e
"
-
>
F
i
r
s
t
C
a
s
e
[
c
o
u
r
s
e
,
X
M
L
E
l
e
m
e
n
t
[
"
a
"
,
{
"
s
h
a
p
e
"
"
r
e
c
t
"
,
"
n
a
m
e
"
c
o
d
e
_
S
t
r
i
n
g
}
,
{
}
]
:
>
S
t
r
i
n
g
T
r
i
m
[
c
o
d
e
]
,
M
i
s
s
i
n
g
[
"
N
o
t
A
p
p
l
i
c
a
b
l
e
"
]
,
A
l
l
]
,
(
*
c
o
u
r
s
e
n
a
m
e
*
)
"
N
a
m
e
"
-
>
F
i
r
s
t
C
a
s
e
[
c
o
u
r
s
e
,
X
M
L
E
l
e
m
e
n
t
[
"
h
3
"
,
{
}
,
{
n
a
m
e
_
S
t
r
i
n
g
,
X
M
L
E
l
e
m
e
n
t
[
"
b
r
"
,
{
"
c
l
e
a
r
"
"
n
o
n
e
"
}
,
{
}
]
,
_
_
_
,
_
_
_
}
]
:
>
S
t
r
i
n
g
T
r
i
m
[
S
t
r
i
n
g
S
p
l
i
t
[
n
a
m
e
,
D
i
g
i
t
C
h
a
r
a
c
t
e
r
~
~
"
.
"
~
~
S
h
o
r
t
e
s
t
[
_
_
_
]
~
~
(
"
"
)
.
.
]
[
[
1
]
]
]
,
M
i
s
s
i
n
g
[
"
N
o
t
A
p
p
l
i
c
a
b
l
e
"
]
,
A
l
l
]
,
(
*
c
o
u
r
s
e
d
e
s
c
r
i
p
t
i
o
n
*
)
"
D
e
s
c
r
i
p
t
i
o
n
"
-
>
F
i
r
s
t
C
a
s
e
[
c
o
u
r
s
e
,
p
a
t
:
{
_
_
_
,
X
M
L
E
l
e
m
e
n
t
[
"
i
m
g
"
,
{
"
a
l
t
"
_
_
_
,
"
s
r
c
"
_
_
_
}
,
{
}
]
,
_
_
_
,
X
M
L
E
l
e
m
e
n
t
[
"
b
r
"
,
{
"
c
l
e
a
r
"
"
n
o
n
e
"
}
,
{
}
]
,
d
e
s
c
_
_
_
,
X
M
L
E
l
e
m
e
n
t
[
"
b
r
"
,
{
"
c
l
e
a
r
"
"
n
o
n
e
"
}
,
{
}
]
,
_
_
_
}
:
>
S
t
r
i
n
g
T
r
i
m
[
d
e
s
c
]
,
"
n
o
"
,
A
l
l
]
]
;
This is the template for subjective course listing at MIT:
http://student.mit.edu/catalog/m*SubjectCode*a.html
where, for instance SubjectCode=8 is for the physics course listing in Fall 2021.
subjectLinkMIT
generates a valid link to a subjective course listing.
I
n
[
]
:
=
s
u
b
j
e
c
t
L
i
n
k
M
I
T
[
s
u
b
j
e
c
t
C
o
d
e
_
I
n
t
e
g
e
r
]
:
=
"
h
t
t
p
:
/
/
s
t
u
d
e
n
t
.
m
i
t
.
e
d
u
/
c
a
t
a
l
o
g
/
m
"
<
>
T
o
S
t
r
i
n
g
[
s
u
b
j
e
c
t
C
o
d
e
]
<
>
"
a
.
h
t
m
l
"
As mention, the XTML versions of the course listings do not follow a standard protocol. Here,
importXMLMIT
extract a list of course descriptions from a subjective course listing
I
n
[
]
:
=
c
o
u
r
s
e
s
I
n
S
u
b
j
e
c
t
M
I
T
[
w
e
b
p
a
g
e
_
S
t
r
i
n
g
]
:
=
I
m
p
o
r
t
S
t
r
i
n
g
[
#
,
{
"
H
T
M
L
"
,
"
X
M
L
O
b
j
e
c
t
"
}
]
&
/
@
S
t
r
i
n
g
C
a
s
e
s
[
I
m
p
o
r
t
[
w
e
b
p
a
g
e
,
{
"
H
T
M
L
"
,
"
S
o
u
r
c
e
"
}
]
,
"
<
a
n
a
m
e
=
"
~
~
S
h
o
r
t
e
s
t
[
_
_
_
]
~
~
"
<
!
-
-
e
n
d
-
-
>
"
]
;
subjectListGenMIT
downloads the subjective course at the current term, exports it as a dataset, and also returns it as a dataset. The
academicTerm
argument, for instance Fall2021, is given as input by the user by hand
I
n
[
]
:
=
s
u
b
j
e
c
L
i
s
t
i
n
g
M
I
T
[
s
u
b
j
e
c
t
C
o
d
e
_
I
n
t
e
g
e
r
,
s
u
b
j
e
c
t
N
a
m
e
_
S
t
r
i
n
g
,
a
c
a
d
e
m
i
c
T
e
r
m
_
S
t
r
i
n
g
]
:
=
M
o
d
u
l
e
[
{
s
u
b
j
e
c
t
L
i
s
t
i
n
g
}
,
s
u
b
j
e
c
t
L
i
s
t
i
n
g
=
D
a
t
a
s
e
t
@
K
e
y
U
n
i
o
n
@
(
p
a
r
s
e
C
o
u
r
s
e
M
I
T
/
@
c
o
u
r
s
e
s
I
n
S
u
b
j
e
c
t
M
I
T
@
s
u
b
j
e
c
t
L
i
n
k
M
I
T
[
s
u
b
j
e
c
t
C
o
d
e
]
)
;
E
x
p
o
r
t
[
S
e
t
D
i
r
e
c
t
o
r
y
[
N
o
t
e
b
o
o
k
D
i
r
e
c
t
o
r
y
[
]
]
;
S
t
r
i
n
g
T
e
m
p
l
a
t
e
[
"
M
I
T
_
u
n
d
e
r
g
r
a
d
_
`
s
u
b
j
e
c
t
`
_
l
i
s
t
_
`
t
e
r
m
`
.
w
x
f
"
]
[
<
|
"
t
e
r
m
"
-
>
a
c
a
d
e
m
i
c
T
e
r
m
,
"
s
u
b
j
e
c
t
"
-
>
s
u
b
j
e
c
t
N
a
m
e
|
>
]
,
s
u
b
j
e
c
t
L
i
s
t
i
n
g
]
;
s
u
b
j
e
c
t
L
i
s
t
i
n
g
]
;
Physics course listing at MIT in the academic term Fall 2021 is created and exported.
O
u
t
[
]
=
M
I
T
o
h
y
s
i
c
s
u
n
d
e
r
g
r
a
d
u
a
t
e
c
o
u
r
s
e
l
i
s
t
i
n
g
H
e
a
d
:
D
a
t
a
s
e
t
B
y
t
e
c
o
u
n
t
:
5
7
8
8
0
Exploratory Data Visualization
The subjective or whole course listing is visualized to give some ideas about listings before text analysis.
Visualizing the relation between courses in UWaterloo
It is impossible to visually export all the course listings with the visualization tools due to differences in the course descriptions in different universities. Her
graphReq
plot a common neighbors network with some chosen options, using the modularity measure — Caution: The readability of the visualized graphs is highly sensitive to input graph.
I
n
[
]
:
=
g
r
a
p
h
R
e
q
[
g
r
a
p
h
_
L
i
s
t
,
l
a
b
e
l
_
S
t
r
i
n
g
]
:
=
F
r
a
m
e
d
@
L
a
b
e
l
e
d
[
C
o
m
m
u
n
i
t
y
G
r
a
p
h
P
l
o
t
[
g
r
a
p
h
,
F
i
n
d
G
r
a
p
h
C
o
m
m
u
n
i
t
i
e
s
[
g
r
a
p
h
,
M
e
t
h
o
d
"
M
o
d
u
l
a
r
i
t
y
"
]
,
D
i
r
e
c
t
e
d
E
d
g
e
s
T
r
u
e
,
I
m
a
g
e
S
i
z
e
3
0
0
,
A
s
p
e
c
t
R
a
t
i
o
1
]
,
S
t
y
l
e
[
l
a
b
e
l
,
1
0
]
,
T
o
p
]
;
graphAntiReq plot
a common neighbors network with some chosen options, using the modularity measure.
I
n
[
]
:
=
g
r
a
p
h
A
n
t
i
R
e
q
[
g
r
a
p
h
_
L
i
s
t
,
l
a
b
e
l
_
S
t
r
i
n
g
]
:
=
F
r
a
m
e
d
@
L
a
b
e
l
e
d
[
C
o
m
m
u
n
i
t
y
G
r
a
p
h
P
l
o
t
[
S
i
m
p
l
e
G
r
a
p
h
@
U
n
d
i
r
e
c
t
e
d
G
r
a
p
h
@
g
r
a
p
h
,
F
i
n
d
G
r
a
p
h
C
o
m
m
u
n
i
t
i
e
s
[
g
r
a
p
h
,
M
e
t
h
o
d
-
>
"
M
o
d
u
l
a
r
i
t
y
"
]
,
D
i
r
e
c
t
e
d
E
d
g
e
s
-
>
F
a
l
s
e
,
I
m
a
g
e
S
i
z
e
-
>
3
0
0
,
A
s
p
e
c
t
R
a
t
i
o
-
>
1
,
E
d
g
e
S
h
a
p
e
F
u
n
c
t
i
o
n
-
>
(
{
A
r
r
o
w
h
e
a
d
s
[
{
-
0
.
0
2
,
0
.
0
2
}
]
,
A
r
r
o
w
[
#
1
]
}
&
)
]
,
S
t
y
l
e
[
l
a
b
e
l
,
1
0
]
,
T
o
p
]
The networks of pre-, co-, and anti-requisite course are plotted for a given subjective course list
I
n
[
]
:
=
r
e
q
u
i
s
t
e
G
r
a
p
h
U
W
[
d
a
t
a
s
e
t
_
D
a
t
a
s
e
t
,
s
u
b
j
e
c
t
N
a
m
e
_
S
t
r
i
n
g
,
a
c
a
d
e
m
i
c
C
a
l
_
S
t
r
i
n
g
]
:
=
F
r
a
m
e
d
[
L
a
b
e
l
e
d
[
M
u
l
t
i
c
o
l
u
m
n
[
{
(
*
P
r
e
r
e
q
u
i
s
i
t
e
g
r
a
p
h
*
)
g
r
a
p
h
R
e
q
[
F
l
a
t
t
e
n
[
N
o
r
m
a
l
@
D
e
l
e
t
e
M
i
s
s
i
n
g
[
d
a
t
a
s
e
t
[
A
l
l
,
T
h
r
e
a
d
[
#
P
r
e
r
e
q
u
i
s
i
t
e
-
>
#
C
o
d
e
]
&
]
,
1
,
I
n
f
i
n
i
t
y
]
,
1
]
,
"
P
r
e
r
e
q
u
i
s
i
t
e
s
(
A
r
r
o
w
s
m
e
a
n
f
o
r
)
"
]
,
(
*
C
o
r
e
q
u
i
s
i
t
e
g
r
a
p
h
*
)
g
r
a
p
h
R
e
q
[
D
e
l
e
t
e
C
a
s
e
s
[
N
o
r
m
a
l
@
F
l
a
t
t
e
n
[
D
e
l
e
t
e
M
i
s
s
i
n
g
[
d
a
t
a
s
e
t
[
A
l
l
,
T
h
r
e
a
d
[
#
C
o
r
e
q
u
i
s
i
t
e
<
-
>
#
C
o
d
e
]
&
]
,
1
,
I
n
f
i
n
i
t
y
]
,
1
]
,
(
o
n
e
_
S
t
r
i
n
g
<
-
>
t
w
o
_
S
t
r
i
n
g
)
/
;
S
t
r
i
n
g
P
a
r
t
[
t
w
o
,
-
1
]
=
=
"
L
"
,
2
]
,
"
C
o
r
e
q
u
i
s
i
t
e
s
"
]
(
*
A
n
t
i
r
e
q
u
i
s
i
t
e
g
r
a
p
h
*
)
,
g
r
a
p
h
A
n
t
i
R
e
q
[
F
l
a
t
t
e
n
[
N
o
r
m
a
l
@
D
e
l
e
t
e
M
i
s
s
i
n
g
[
d
a
t
a
s
e
t
[
A
l
l
,
T
h
r
e
a
d
[
#
C
o
d
e
-
>
#
A
n
t
i
r
e
q
u
i
s
i
t
e
]
&
]
,
1
,
I
n
f
i
n
i
t
y
]
,
1
]
,
"
A
n
t
i
r
e
q
u
i
s
i
t
e
s
(
D
o
u
b
l
e
-
a
r
r
o
w
m
e
a
n
s
m
u
t
u
a
l
e
x
c
l
u
s
i
v
e
)
"
]
}
,
3
]
,
s
t
y
l
e
T
i
t
l
e
[
S
t
r
i
n
g
T
e
m
p
l
a
t
e
[
"
N
e
t
w
o
r
k
o
f
r
e
q
u
i
s
i
t
e
s
i
n
`
s
u
b
j
e
c
t
`
(
A
c
a
d
e
m
i
c
c
a
l
e
n
d
e
r
`
y
e
a
r
`
)
,
u
s
i
n
g
t
h
e
m
o
d
u
l
a
r
i
t
y
m
e
a
s
u
r
e
"
]
[
<
|
"
s
u
b
j
e
c
t
"
-
>
s
u
b
j
e
c
t
N
a
m
e
,
"
y
e
a
r
"
-
>
a
c
a
d
e
m
i
c
C
a
l
|
>
]
]
,
T
o
p
]
]
;
Let’s look at the requisites for different course at four subjects at UWaterloo:
I
n
[
]
:
=
M
u
l
t
i
c
o
l
u
m
n
[
{
r
e
q
u
i
s
t
e
G
r
a
p
h
U
W
[
p
h
y
s
U
W
2
1
2
2
,
"
P
h
y
s
i
c
s
"
,
"
2
0
2
1
-
2
2
"
]
,
r
e
q
u
i
s
t
e
G
r
a
p
h
U
W
[
a
m
a
t
h
U
W
2
1
2
2
,
"
A
p
p
l
i
e
d
M
a
t
h
e
m
a
t
i
c
s
"
,
"
2
0
2
1
-
2
2
"
]
,
r
e
q
u
i
s
t
e
G
r
a
p
h
U
W
[
m
a
t
h
U
W
2
1
2
2
,
"
M
a
t
h
e
m
a
t
i
c
s
"
,
"
2
0
2
1
-
2
2
"
]
,
r
e
q
u
i
s
t
e
G
r
a
p
h
U
W
[
c
s
U
W
2
1
2
2
,
"
C
o
m
p
u
t
e
r
S
c
i
e
n
c
e
"
,
"
2
0
2
1
-
2
2
"
]
}
,
1
]
O
u
t
[
]
=
This above figures show the complexity of course interconnections at the subjective level. For the courses in each subject, the interconnections with courses from other subjects are also considered in generating each sub-plot.
Text Analysis
There are different approaches by which the course similarity can be measured between two courses in one subject or two different subjects in one university or more universties. Regardless of the chosen approach, it is needed to first tokenize the course topic, syllabus, or ILOs, hence it can be used in Machine Learning (ML).
Making course information machine-understandable
Below, different
tokenizing
functions are defined and the reason for defining each of them is given.
Course descriptions at UWaterloo and MIT
Here, the subsets of UWaterloo and MIT physics course listings are selected and converted to associations. For UWaterloo physics course listing, the lab courses which are the co-requisites of other course are dropped and the ones which are independent courses are preserved. Moreover, these co-requisite lab courses usually do not have standalone course descriptions. At MIT, the lab courses are independent courses and need their theoretical counterparts as prerequisites. Additionally, they have their own course descriptions. Therefore, these lab course are not dropped from MIT physics course listing.
I
n
[
]
:
=
Association of pairs of course code and description.
I
n
[
]
:
=
p
h
y
s
U
W
2
1
2
2
D
e
s
c
=
A
s
s
o
c
i
a
t
i
o
n
@
N
o
r
m
a
l
@
p
h
y
s
U
W
2
1
2
2
[
S
e
l
e
c
t
[
!
(
(
#
L
A
B
=
=
T
r
u
e
)
&
&
(
S
t
r
i
n
g
T
a
k
e
[
#
C
o
d
e
,
{
1
,
-
2
}
]
=
=
#
C
o
r
e
q
u
i
s
i
t
e
[
[
1
]
]
)
)
&
]
,
#
C
o
d
e
-
>
#
D
e
s
c
r
i
p
t
i
o
n
&
]
;
p
h
y
s
M
I
T
F
a
l
l
2
0
2
1
D
e
s
c
=
A
s
s
o
c
i
a
t
i
o
n
@
N
o
r
m
a
l
@
p
h
y
s
M
I
T
F
a
l
l
2
0
2
1
[
A
l
l
,
#
C
o
d
e
-
>
#
D
e
s
c
r
i
p
t
i
o
n
&
]
;
Association of pairs of course code and name.
I
n
[
]
:
=
p
h
y
s
U
W
2
1
2
2
N
a
m
e
=
A
s
s
o
c
i
a
t
i
o
n
@
N
o
r
m
a
l
@
p
h
y
s
U
W
2
1
2
2
[
S
e
l
e
c
t
[
!
(
(
#
L
A
B
=
=
T
r
u
e
)
&
&
(
S
t
r
i
n
g
T
a
k
e
[
#
C
o
d
e
,
{
1
,
-
2
}
]
=
=
#
C
o
r
e
q
u
i
s
i
t
e
[
[
1
]
]
)
)
&
]
,
#
C
o
d
e
-
>
#
N
a
m
e
&
]
;
p
h
y
s
M
I
T
F
a
l
l
2
0
2
1
N
a
m
e
=
A
s
s
o
c
i
a
t
i
o
n
@
N
o
r
m
a
l
@
p
h
y
s
M
I
T
F
a
l
l
2
0
2
1
[
A
l
l
,
#
C
o
d
e
-
>
#
N
a
m
e
&
]
;
Tokenize functions
In natural language processing (NLP), tokenization is a process in which a sequence of characters into smaller units called tokens, based on a pre-defined document unit. Tokens are sometimes vaguely called words or terms. As an instance of the document unit, A token can be a character, a sequence of characters, a word, a phrase, or a sentence, depending on the context of document(s) and the measures used in definition of the document unit.
In what follows, several tokenization functions are introduced, each constructed based on a series of rules for defining a document unit. For instance, the first function
tokenizeCommon
use five rules: making uppercase, removing diacritics, deleting stop words, chunking into words, and trimming whitespace from the beginning and end of each word.
tokenizeCommon
is a popular tokenizing function.
I
n
[
]
:
=
t
o
k
e
n
i
z
e
[
t
e
x
t
_
]
:
=
S
t
r
i
n
g
T
r
i
m
@
T
e
x
t
W
o
r
d
s
@
D
e
l
e
t
e
S
t
o
p
w
o
r
d
s
@
R
e
m
o
v
e
D
i
a
c
r
i
t
i
c
s
@
T
o
L
o
w
e
r
C
a
s
e
@
t
e
x
t
Course topics are usually
noun phrases
(more exactly one or more noun phrases that are connected with “and”) and are separated by “,”, “;”,”.”, and “and” — This statement is based on the visual inspection of course listing at UWaterloo and MIT. To have better understanding of these two features, let’s look at two course descriptions, one from each university
O
u
t
[
]
=
P
h
y
s
i
c
s
f
o
r
E
n
g
i
n
e
e
r
s
a
t
U
W
a
t
e
r
l
o
o
O
s
c
i
l
l
a
t
i
o
n
s
N
o
u
n
N
o
u
n
P
h
r
a
s
e
;
P
u
n
c
t
u
a
t
i
o
n
s
i
m
p
l
e
A
d
j
e
c
t
i
v
e
h
a
r
m
o
n
i
c
A
d
j
e
c
t
i
v
e
A
d
j
e
c
t
i
v
e
P
h
r
a
s
e
m
o
t
i
o
n
N
o
u
n
N
o
u
n
P
h
r
a
s
e
.
P
u
n
c
t
u
a
t
i
o
n
N
o
u
n
P
h
r
a
s
e
W
a
v
e
A
d
j
e
c
t
i
v
e
m
o
t
i
o
n
N
o
u
n
,
P
u
n
c
t
u
a
t
i
o
n
t
r
a
v
e
l
l
i
n
g
N
o
u
n
a
n
d
C
o
n
j
u
n
c
t
i
o
n
s
t
a
n
d
i
n
g
N
o
u
n
w
a
v
e
s
N
o
u
n
N
o
u
n
P
h
r
a
s
e
;
P
u
n
c
t
u
a
t
i
o
n
t
r
a
n
s
v
e
r
s
e
A
d
j
e
c
t
i
v
e
a
n
d
C
o
n
j
u
n
c
t
i
o
n
l
o
n
g
i
t
u
d
i
n
a
l
A
d
j
e
c
t
i
v
e
A
d
j
e
c
t
i
v
e
P
h
r
a
s
e
w
a
v
e
s
N
o
u
n
N
o
u
n
P
h
r
a
s
e
,
P
u
n
c
t
u
a
t
i
o
n
i
n
c
l
u
d
i
n
g
V
e
r
b
s
o
u
n
d
N
o
u
n
N
o
u
n
P
h
r
a
s
e
P
r
e
p
o
s
i
t
i
o
n
a
l
P
h
r
a
s
e
N
o
u
n
P
h
r
a
s
e
.
P
u
n
c
t
u
a
t
i
o
n
N
o
u
n
P
h
r
a
s
e
G
e
o
m
e
t
r
i
c
a
l
A
d
j
e
c
t
i
v
e
o
p
t
i
c
s
N
o
u
n
N
o
u
n
P
h
r
a
s
e
;
P
u
n
c
t
u
a
t
i
o
n
r
e
f
l
e
c
t
i
o
n
N
o
u
n
a
n
d
C
o
n
j
u
n
c
t
i
o
n
r
e
f
r
a
c
t
i
o
n
N
o
u
n
N
o
u
n
P
h
r
a
s
e
.
P
u
n
c
t
u
a
t
i
o
n
N
o
u
n
P
h
r
a
s
e
P
h
y
s
i
c
a
l
A
d
j
e
c
t
i
v
e
o
p
t
i
c
s
N
o
u
n
N
o
u
n
P
h
r
a
s
e
;
P
u
n
c
t
u
a
t
i
o
n
i
n
t
e
r
f
e
r
e
n
c
e
N
o
u
n
a
n
d
C
o
n
j
u
n
c
t
i
o
n
d
i
f
f
r
a
c
t
i
o
n
N
o
u
n
N
o
u
n
P
h
r
a
s
e
.
P
u
n
c
t
u
a
t
i
o
n
N
o
u
n
P
h
r
a
s
e
Q
u
a
n
t
u
m
P
r
o
p
e
r
N
o
u
n
p
h
y
s
i
c
s
N
o
u
n
N
o
u
n
P
h
r
a
s
e
;
P
u
n
c
t
u
a
t
i
o
n
q
u
a
n
t
i
z
a
t
i
o
n
N
o
u
n
N
o
u
n
P
h
r
a
s
e
o
f
P
r
e
p
o
s
i
t
i
o
n
r
a
d
i
a
t
i
o
n
N
o
u
n
N
o
u
n
P
h
r
a
s
e
P
r
e
p
o
s
i
t
i
o
n
a
l
P
h
r
a
s
e
N
o
u
n
P
h
r
a
s
e
;
P
u
n
c
t
u
a
t
i
o
n
h
y
d
r
o
g
e
n
N
o
u
n
N
o
u
n
P
h
r
a
s
e
a
t
o
m
N
o
u
n
N
o
u
n
P
h
r
a
s
e
N
o
u
n
P
h
r
a
s
e
.
P
u
n
c
t
u
a
t
i
o
n
N
o
u
n
P
h
r
a
s
e
These feature of course topics can allow more precise definitions of tokenize functions, as done below.
tokenizeNounPhrase
uses TextCases with text content type “NounPhrase” as the input form and tokenizes a course topic into a list of noun phrases. Since
TextCases[text,”NounPhrase”]
returns recursively noun phrases, i.e. return noun phrases and noun phrases with each noun phrases until there is no more noun phrases,
DeleteDuplicates
is applied to drop the repeated noun phrases at the course level.
I
n
[
]
:
=
t
o
k
e
n
i
z
e
N
o
u
n
P
h
r
a
s
e
[
t
e
x
t
_
]
:
=
S
t
r
i
n
g
T
r
i
m
@
D
e
l
e
t
e
D
u
p
l
i
c
a
t
e
s
@
T
e
x
t
C
a
s
e
s
[
R
e
m
o
v
e
D
i
a
c
r
i
t
i
c
s
@
T
o
L
o
w
e
r
C
a
s
e
@
t
e
x
t
,
"
N
o
u
n
P
h
r
a
s
e
"
]
tokenizeStringSplit uses {“,”, “;”,”.”} to split a course topic into a list of topics
I
n
[
]
:
=
t
o
k
e
n
i
z
e
S
t
r
i
n
g
S
p
l
i
t
[
t
e
x
t
_
]
:
=
S
t
r
i
n
g
T
r
i
m
@
D
e
l
e
t
e
D
u
p
l
i
c
a
t
e
s
@
S
e
l
e
c
t
[
S
t
r
i
n
g
S
p
l
i
t
[
R
e
m
o
v
e
D
i
a
c
r
i
t
i
c
s
@
T
o
L
o
w
e
r
C
a
s
e
@
t
e
x
t
,
P
u
n
c
t
u
a
t
i
o
n
C
h
a
r
a
c
t
e
r
]
,
S
t
r
i
n
g
L
e
n
g
t
h
[
#
]
!
=
0
&
]
Two other strategies can also be considered. In the first strategy, “Noun” text content type as the input form in Textcases.
tokenizeNoun
returns a list of nouns as token. Like tokenizeNounPhrase, it drops the duplicates. Unlike tokenizeNounPhrase and tokenizeStringSplit, it drops stop
I
n
[
]
:
=
t
o
k
e
n
i
z
e
N
o
u
n
[
t
e
x
t
_
]
:
=
S
t
r
i
n
g
T
r
i
m
@
D
e
l
e
t
e
D
u
p
l
i
c
a
t
e
s
@
T
e
x
t
C
a
s
e
s
[
R
e
m
o
v
e
D
i
a
c
r
i
t
i
c
s
@
T
o
L
o
w
e
r
C
a
s
e
@
t
e
x
t
,
"
N
o
u
n
"
]
Before introducing the second strategy, let’s compare the outcomes of the tokenize functions:
O
u
t
[
]
=