Methodology
Computer Information
Experiments were performed on Ubuntu 20.04 running on Windows Subsystem for Linux on an 8-core, 16-thread AMD Ryzen 3700X and NVIDIA RTX 2070 SUPER, with 32GB maximum virtual memory usage. The memory limit was quickly reached with analysis of full article content, so future experimentation on a virtual machine or on upgraded hardware should be carried out to extend upon the methods in this paper.
Software
Python 3.7
Linear Algebra and Data: NumPy
, pandas
Neural Network Architecture: Tensorflow
, keras
backend
Statistics, Classification, and Evaluation: scikit-learn
Other Packages: re
, time
Preprocessing
For experiments where the input length was constrained, all titles of at least 30 and no more than 96 characters in length were retained, while all others were discarded. All experiments had data that were balanced by finding the greatest multiple of 100 less than or equal to the size of either class, and capping the number of samples per class to that number. From there, the quotation marks are cleaned to standard nondirectional single and double quotes, and each character is assigned an ID such that every non-whitespace character is assigned an ID, with unknown characters being assigned UNK
; since the length of this ID table depends on the character set used, uppercase and lowercase characters are not assigned uniform IDs. These strings of IDs are padded with 0 characters to achieve a uniform length so that the model may take the information as input.
Classification
Experimentation was performed initially using a classifier with internals nearly identical to that of Zhang, Zhao, and LeCun (2016). From here, we varied the input length, character set, and number of nodes within the fully-connected layers. The activation layers are governed by the \(\mathrm{ReLU}(x)= \max \{0,x\}\) and batch size for training was set to 64 samples.
Layer.Order | C1D.Filters | Kernel.Size | MP.Factor |
---|---|---|---|
C-M-C-M-C-C-C-C-M-D-D | 256 | 6-6-2-2-2-2 | 3 |
X.Constants. | Samples | Case | Input | FC.Size | Params |
---|---|---|---|---|---|
Constants | 42800 | L | 339 | 1024 | 4786143 |
Constants | 42800 | U | 339 | 1024 | 4837025 |
Constants | 42800 | L | 339 | 256 | 2031327 |
Constants | 42800 | U | 339 | 256 | 2082209 |
Constants | 29800 | L | 123 | 256 | 1507039 |
Constants | 29800 | U | 123 | 256 | 1557921 |
Constants | 29800 | L | 123 | 64 | 1395871 |
Constants | 29800 | U | 123 | 64 | 1446753 |