Sentence ReconstructionΒΆ
The purpose of this project is to take in input a sequence of words corresponding to a random permutation of a given english sentence, and reconstruct the original sentence.
The otuput can be either produced in a single shot, or through an iterative (autoregressive) loop generating a single token at a time.
CONSTRAINTS:
- No pretrained model can be used.
- The neural network models should have less the 20M parameters.
DatasetΒΆ
The dataset is composed by a snapshot of wikipedia. We restricted the vocabolary to the 10K most frequent words, and only took sentences making use of this vocabulary. In addition, we restricted to sequences with a length between 3 and 30 words.
(Ignore the error, if any)
!pip install datasets
!pip3 install apache-beam
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/ Collecting datasets Downloading datasets-2.12.0-py3-none-any.whl (474 kB) βββββββββββββββββββββββββββββββββββββββ 474.6/474.6 kB 9.2 MB/s eta 0:00:00 Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from datasets) (1.22.4) Requirement already satisfied: pyarrow>=8.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (9.0.0) Collecting dill<0.3.7,>=0.3.0 (from datasets) Downloading dill-0.3.6-py3-none-any.whl (110 kB) ββββββββββββββββββββββββββββββββββββββ 110.5/110.5 kB 12.4 MB/s eta 0:00:00 Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from datasets) (1.5.3) Requirement already satisfied: requests>=2.19.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (2.27.1) Requirement already satisfied: tqdm>=4.62.1 in /usr/local/lib/python3.10/dist-packages (from datasets) (4.65.0) Collecting xxhash (from datasets) Downloading xxhash-3.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB) ββββββββββββββββββββββββββββββββββββββ 212.5/212.5 kB 20.0 MB/s eta 0:00:00 Collecting multiprocess (from datasets) Downloading multiprocess-0.70.14-py310-none-any.whl (134 kB) ββββββββββββββββββββββββββββββββββββββ 134.3/134.3 kB 13.8 MB/s eta 0:00:00 Requirement already satisfied: fsspec[http]>=2021.11.1 in /usr/local/lib/python3.10/dist-packages (from datasets) (2023.4.0) Collecting aiohttp (from datasets) Downloading aiohttp-3.8.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB) ββββββββββββββββββββββββββββββββββββββββ 1.0/1.0 MB 36.2 MB/s eta 0:00:00 Collecting huggingface-hub<1.0.0,>=0.11.0 (from datasets) Downloading huggingface_hub-0.15.1-py3-none-any.whl (236 kB) ββββββββββββββββββββββββββββββββββββββ 236.8/236.8 kB 21.9 MB/s eta 0:00:00 Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from datasets) (23.1) Collecting responses<0.19 (from datasets) Downloading responses-0.18.0-py3-none-any.whl (38 kB) Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from datasets) (6.0) Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (23.1.0) Requirement already satisfied: charset-normalizer<4.0,>=2.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (2.0.12) Collecting multidict<7.0,>=4.5 (from aiohttp->datasets) Downloading multidict-6.0.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (114 kB) ββββββββββββββββββββββββββββββββββββββ 114.5/114.5 kB 12.0 MB/s eta 0:00:00 Collecting async-timeout<5.0,>=4.0.0a3 (from aiohttp->datasets) Downloading async_timeout-4.0.2-py3-none-any.whl (5.8 kB) Collecting yarl<2.0,>=1.0 (from aiohttp->datasets) Downloading yarl-1.9.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (268 kB) ββββββββββββββββββββββββββββββββββββββ 268.8/268.8 kB 21.9 MB/s eta 0:00:00 Collecting frozenlist>=1.1.1 (from aiohttp->datasets) Downloading frozenlist-1.3.3-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (149 kB) ββββββββββββββββββββββββββββββββββββββ 149.6/149.6 kB 14.6 MB/s eta 0:00:00 Collecting aiosignal>=1.1.2 (from aiohttp->datasets) Downloading aiosignal-1.3.1-py3-none-any.whl (7.6 kB) Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0.0,>=0.11.0->datasets) (3.12.0) Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0.0,>=0.11.0->datasets) (4.5.0) Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets) (1.26.15) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets) (2022.12.7) Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets) (3.4) Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2.8.2) Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2022.7.1) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas->datasets) (1.16.0) Installing collected packages: xxhash, multidict, frozenlist, dill, async-timeout, yarl, responses, multiprocess, huggingface-hub, aiosignal, aiohttp, datasets Successfully installed aiohttp-3.8.4 aiosignal-1.3.1 async-timeout-4.0.2 datasets-2.12.0 dill-0.3.6 frozenlist-1.3.3 huggingface-hub-0.15.1 multidict-6.0.4 multiprocess-0.70.14 responses-0.18.0 xxhash-3.2.0 yarl-1.9.2 Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/ Collecting apache-beam Downloading apache_beam-2.48.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (14.3 MB) ββββββββββββββββββββββββββββββββββββββββ 14.3/14.3 MB 78.8 MB/s eta 0:00:00 Collecting crcmod<2.0,>=1.7 (from apache-beam) Downloading crcmod-1.7.tar.gz (89 kB) ββββββββββββββββββββββββββββββββββββββββ 89.7/89.7 kB 8.9 MB/s eta 0:00:00 Preparing metadata (setup.py) ... done Collecting orjson<4.0 (from apache-beam) Downloading orjson-3.9.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (136 kB) ββββββββββββββββββββββββββββββββββββββ 137.0/137.0 kB 13.5 MB/s eta 0:00:00 Collecting dill<0.3.2,>=0.3.1.1 (from apache-beam) Downloading dill-0.3.1.1.tar.gz (151 kB) ββββββββββββββββββββββββββββββββββββββ 152.0/152.0 kB 15.5 MB/s eta 0:00:00 Preparing metadata (setup.py) ... done Requirement already satisfied: cloudpickle~=2.2.1 in /usr/local/lib/python3.10/dist-packages (from apache-beam) (2.2.1) Collecting fastavro<2,>=0.23.6 (from apache-beam) Downloading fastavro-1.7.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.6 MB) ββββββββββββββββββββββββββββββββββββββββ 2.6/2.6 MB 61.2 MB/s eta 0:00:00 Collecting fasteners<1.0,>=0.3 (from apache-beam) Downloading fasteners-0.18-py3-none-any.whl (18 kB) Requirement already satisfied: grpcio!=1.48.0,<2,>=1.33.1 in /usr/local/lib/python3.10/dist-packages (from apache-beam) (1.54.0) Collecting hdfs<3.0.0,>=2.1.0 (from apache-beam) Downloading hdfs-2.7.0-py3-none-any.whl (34 kB) Requirement already satisfied: httplib2<0.23.0,>=0.8 in /usr/local/lib/python3.10/dist-packages (from apache-beam) (0.21.0) Requirement already satisfied: numpy<1.25.0,>=1.14.3 in /usr/local/lib/python3.10/dist-packages (from apache-beam) (1.22.4) Collecting objsize<0.7.0,>=0.6.1 (from apache-beam) Downloading objsize-0.6.1-py3-none-any.whl (9.3 kB) Collecting pymongo<5.0.0,>=3.8.0 (from apache-beam) Downloading pymongo-4.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (492 kB) ββββββββββββββββββββββββββββββββββββββ 492.9/492.9 kB 25.6 MB/s eta 0:00:00 Requirement already satisfied: proto-plus<2,>=1.7.1 in /usr/local/lib/python3.10/dist-packages (from apache-beam) (1.22.2) Requirement already satisfied: protobuf<4.24.0,>=3.20.3 in /usr/local/lib/python3.10/dist-packages (from apache-beam) (3.20.3) Requirement already satisfied: pydot<2,>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from apache-beam) (1.4.2) Requirement already satisfied: python-dateutil<3,>=2.8.0 in /usr/local/lib/python3.10/dist-packages (from apache-beam) (2.8.2) Requirement already satisfied: pytz>=2018.3 in /usr/local/lib/python3.10/dist-packages (from apache-beam) (2022.7.1) Requirement already satisfied: regex>=2020.6.8 in /usr/local/lib/python3.10/dist-packages (from apache-beam) (2022.10.31) Requirement already satisfied: requests<3.0.0,>=2.24.0 in /usr/local/lib/python3.10/dist-packages (from apache-beam) (2.27.1) Requirement already satisfied: typing-extensions>=3.7.0 in /usr/local/lib/python3.10/dist-packages (from apache-beam) (4.5.0) Collecting zstandard<1,>=0.18.0 (from apache-beam) Downloading zstandard-0.21.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.7 MB) ββββββββββββββββββββββββββββββββββββββββ 2.7/2.7 MB 90.5 MB/s eta 0:00:00 Requirement already satisfied: pyarrow<12.0.0,>=3.0.0 in /usr/local/lib/python3.10/dist-packages (from apache-beam) (9.0.0) Collecting docopt (from hdfs<3.0.0,>=2.1.0->apache-beam) Downloading docopt-0.6.2.tar.gz (25 kB) Preparing metadata (setup.py) ... done Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.10/dist-packages (from hdfs<3.0.0,>=2.1.0->apache-beam) (1.16.0) Requirement already satisfied: pyparsing!=3.0.0,!=3.0.1,!=3.0.2,!=3.0.3,<4,>=2.4.2 in /usr/local/lib/python3.10/dist-packages (from httplib2<0.23.0,>=0.8->apache-beam) (3.0.9) Collecting dnspython<3.0.0,>=1.16.0 (from pymongo<5.0.0,>=3.8.0->apache-beam) Downloading dnspython-2.3.0-py3-none-any.whl (283 kB) ββββββββββββββββββββββββββββββββββββββ 283.7/283.7 kB 23.7 MB/s eta 0:00:00 Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.24.0->apache-beam) (1.26.15) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.24.0->apache-beam) (2022.12.7) Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.24.0->apache-beam) (2.0.12) Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.24.0->apache-beam) (3.4) Building wheels for collected packages: crcmod, dill, docopt Building wheel for crcmod (setup.py) ... done Created wheel for crcmod: filename=crcmod-1.7-cp310-cp310-linux_x86_64.whl size=37112 sha256=cf671b2584d2d11978808280e81f98b1d723e4260d628f0c177bd3f591c01bfd Stored in directory: /root/.cache/pip/wheels/85/4c/07/72215c529bd59d67e3dac29711d7aba1b692f543c808ba9e86 Building wheel for dill (setup.py) ... done Created wheel for dill: filename=dill-0.3.1.1-py3-none-any.whl size=78545 sha256=23e98d8bc57847e84deb6b03d7a296233f3d5daf26a8f589f7aefd0e459dc351 Stored in directory: /root/.cache/pip/wheels/ea/e2/86/64980d90e297e7bf2ce588c2b96e818f5399c515c4bb8a7e4f Building wheel for docopt (setup.py) ... done Created wheel for docopt: filename=docopt-0.6.2-py2.py3-none-any.whl size=13707 sha256=24e03b234836e7ecc949fda7a8f31f4ee2a581b91f51f582f86ca5b65107e9c1 Stored in directory: /root/.cache/pip/wheels/fc/ab/d4/5da2067ac95b36618c629a5f93f809425700506f72c9732fac Successfully built crcmod dill docopt Installing collected packages: docopt, crcmod, zstandard, orjson, objsize, fasteners, fastavro, dnspython, dill, pymongo, hdfs, apache-beam Attempting uninstall: dill Found existing installation: dill 0.3.6 Uninstalling dill-0.3.6: Successfully uninstalled dill-0.3.6 ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. multiprocess 0.70.14 requires dill>=0.3.6, but you have dill 0.3.1.1 which is incompatible. Successfully installed apache-beam-2.48.0 crcmod-1.7 dill-0.3.1.1 dnspython-2.3.0 docopt-0.6.2 fastavro-1.7.4 fasteners-0.18 hdfs-2.7.0 objsize-0.6.1 orjson-3.9.0 pymongo-4.3.3 zstandard-0.21.0
from random import Random
# Instantiate the Random instance with random seed = 42 to ensure reproducibility
randomizer = Random(42)
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical, pad_sequences
import numpy as np
import pickle
import gdown
import random
from datasets import load_dataset
dataset = load_dataset("wikipedia", "20220301.simple")
data = dataset['train'][:20000]['text']
Downloading builder script: 0%| | 0.00/35.9k [00:00<?, ?B/s]
Downloading metadata: 0%| | 0.00/30.4k [00:00<?, ?B/s]
Downloading readme: 0%| | 0.00/16.3k [00:00<?, ?B/s]
Downloading and preparing dataset wikipedia/20220301.simple to /root/.cache/huggingface/datasets/wikipedia/20220301.simple/2.0.0/aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559...
Downloading: 0%| | 0.00/1.66k [00:00<?, ?B/s]
Downloading: 0%| | 0.00/235M [00:00<?, ?B/s]
Dataset wikipedia downloaded and prepared to /root/.cache/huggingface/datasets/wikipedia/20220301.simple/2.0.0/aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559. Subsequent calls will reuse this data.
0%| | 0/1 [00:00<?, ?it/s]
#run this cell only the first time to create and save the tokenizer and the date
dump = True
tokenizer = Tokenizer(split=' ', filters='!"#$%&()*+,-./:;=?@[\\]^_`{|}~\t\n', num_words=10000, oov_token='<unk>')
corpus = []
# Split of each piece of text into sentences
for elem in data:
corpus += elem.lower().replace("\n", "").split(".")[:]
print("corpus dim: ",len(corpus))
#add a start and an end token
corpus = ['<start> '+s+' <end>' for s in corpus]
# Tokenization
tokenizer.fit_on_texts(corpus)
#print(tokenizer.word_index['<unk>'])
if dump:
with open('tokenizer.pickle', 'wb') as handle:
pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)
original_data = [sen for sen in tokenizer.texts_to_sequences(corpus) if (len(sen) <= 32 and len(sen)>4 and not(1 in sen))]
if dump:
with open('original.pickle', 'wb') as handle:
pickle.dump(original_data, handle, protocol=pickle.HIGHEST_PROTOCOL)
print ("filtered sentences: ",len(original_data))
sos = tokenizer.word_index['<start>']
eos = tokenizer.word_index['<end>']
#print(eos)
#print(tokenizer.index_word[sos])
tokenizer.word_index['<pad>'] = 0
tokenizer.index_word[0] = '<pad>'
corpus dim: 510023 filtered sentences: 137301
We now create two additional datasets.
- shuffled_data contains scrumbled sequences, and will be the input to the model.
- target_data is the same as original data but offset by one timestep. It is only useful if you plan to do some language modeling with a teacher forcing technique. You might decide to ignore it.
shuffled_data = [random.sample(s[1:-1],len(s)-2) for s in original_data]
shuffled_data = [[sos]+s+[eos] for s in shuffled_data]
target_data = [s[1:] for s in original_data]
Let us look at some examples:
i = np.random.randint(len(original_data))
print("original sentence: ",original_data[i])
print("shuffled sentecen: ",shuffled_data[i])
original sentence: [2, 1442, 10, 4, 3380, 2339, 6, 4342, 51, 9, 5734, 2778, 3] shuffled sentecen: [2, 2778, 2339, 9, 4, 5734, 10, 1442, 4342, 6, 3380, 51, 3]
Let us look at detokenized data:
i = np.random.randint(len(original_data))
print("original sentence: ",tokenizer.sequences_to_texts([original_data[i]])[0])
print("shuffled sentence: ",tokenizer.sequences_to_texts([shuffled_data[i]])[0])
original sentence: <start> victoria married her first cousin prince albert in 1840 <end> shuffled sentecen: <start> victoria cousin albert 1840 married her prince in first <end>
You goal is to reconstruct the original sentence out of the shuffled one.
Additional materialΒΆ
Here we provide a few additional functions that could be useful to you.
As usual, you are supposed to divide your data in training and test set. Reserve at least 30% of data for testing.
You are likely to need a validation set too.
from sklearn.model_selection import train_test_split
x_train, x_test, c_train, c_test, y_train, y_test = train_test_split(original_data, shuffled_data, target_data, test_size = 0.3, random_state = 42)
Depending from the model you plan to build, you might require padding the input sequence
max_sequence_len = max([len(x) for x in original_data])
x_train = pad_sequences(x_train, maxlen=max_sequence_len, padding='post')
x_test = pad_sequences(x_test, maxlen=max_sequence_len, padding='post')
c_train = pad_sequences(c_train, maxlen=max_sequence_len, padding='post')
c_test = pad_sequences(c_test, maxlen=max_sequence_len, padding='post')
y_train = pad_sequences(y_train, maxlen=max_sequence_len, padding='post')
y_test = pad_sequences(y_test, maxlen=max_sequence_len, padding='post')
print("x_train size:", len(x_train))
assert(len(x_train)==len(c_train)==len(y_train))
x_train size: 96110
Let us finally have a look at the distribution of data w.r.t. their lenght.
import matplotlib.pyplot as pl
plt.hist([len(x)-2 for x in original_data],27)
(array([ 3897., 5516., 6180., 7633., 10474., 11260., 11167., 10501., 9768., 8942., 7828., 7010., 6126., 5236., 4551., 3922., 3260., 2695., 2306., 1922., 1611., 1299., 1126., 827., 773., 586., 885.]), array([ 3., 4., 5., 6., 7., 8., 9., 10., 11., 12., 13., 14., 15., 16., 17., 18., 19., 20., 21., 22., 23., 24., 25., 26., 27., 28., 29., 30.]), <BarContainer object of 27 artists>)
MetricsΒΆ
Let s be the source string and p your prediction. The quality of the results will be measured according to the following metric:
- look for the longest substring w between s and p
- compute |w|/|s|
If the match is exact, the score is 1.
When computing the score, you should NON consider the start and end tokens.
The longest common substring can be computed with the SequenceMatcher function of difflib, that allows a simple definition of our metric.
from difflib import SequenceMatcher
def score(s,p):
match = SequenceMatcher(None, s, p).find_longest_match()
#print(match.size)
return (match.size/len(p))
Let's do an example.
original = "at first henry wanted to be friends with the king of france"
generated = "henry wanted to be friends with king of france at the first"
print("your score is ",score(original,generated))
your score is 0.5423728813559322
The score must be computed as an average of at least 3K random examples taken form the test set.
What to deliverΒΆ
You are supposed to deliver a single notebook, suitably commented. The notebook should describe a single model, although you may briefly discuss additional attempts you did.
The notebook should contain a full trace of the training. Weights should be made available on request.
You must also give a clear assesment of the performance of the model, computed with the metric that has been given to you.