Introduction
Human language analysis is one of the most significant uses of computer science in both business and higher education. Through the use of natural language processing, computers are now able to read text, hear speech, and comprehend it. In this article, we'll explain and demonstrate NLP using a well-known machine learning framework.
What is NLP?
Natural language processing or NLP is the computer program's capacity to understand natural language or human language as it is spoken and written.
It is a branch of computer science and artificial intelligence (AI) that deals with how computers interpret human language.
What is natural language processing used for?
Uses for NLP include:
- Examining content in emails, conversations, or social media,
- Text summarization reduces duplication among data gathered from many sources by summarizing the meaning of documents and information,
- Sentiment analysis identifies the overarching emotions or personal opinions. appropriate for opinion mining,
- Conversion of text to speech and speech to text.
Natural language processing categories
There are two main categories of NLP:
Natural language understanding (NLU): It is the NLP task of extracting insights from natural language inputs.
Natural language generation (NLG): it is the NLP task of building coherent sentences in Natural Language from structured data.
Use case
In this tutorial, we'll develop a model that can produce lyrics based on Adele's lyrics.
In this use case we will use:
- Dataset : Song Lyrics
- Model: Recurrent Neural Network (RNN)
- Python Framework: TensorFlow
The main steps to follow :
- Import the appropriate libraries
- Building the Word Vocabulary
- Preprocessing the Dataset
- Build, compile and train the model
- Generating lyrics
Import the appropriate libraries
import tensorflow as tf
import numpy as np
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
import matplotlib.pyplot as plt
Building the Word Vocabulary
Load the dataset
text = open("Put file path here!!", 'rb').read().decode(encoding = 'utf-8')
In this step, we will first convert the data to lowercase and split it.
Since a computer can only comprehend numerical values, we must create a word index dictionary using the fit on texts function.
corpus = text.lower().split("\n")
tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)
total_words=len(tokenizer.word_index)+1
print('word index dictionary:', tokenizer.word_index)
print('total words:', total_words)
Preprocessing the Dataset
input_seq=[]
for line in corpus:
token_list=tokenizer.texts_to_sequences([line])[0]
for i in range(1, len(token_list)):
sequence_gram = token_list[:i+1]
input_seq.append(sequence_gram)
max_se_len = max([len(x) for x in input_seq])
input_seq= np.array(pad_sequences(input_seq, maxlen=max_se_len, padding='pre'))
How to generate lyrics?
- Enter the input lyrics. Feed it to the reccurent neural network to generate one word after the seed lyrics.
Input: 'love' --> Output: 'to'
Input: 'difference between' --> Output: 'us' - Add a new word to the seed lyrics
New input: 'love to'
New input: 'difference between us' - Repeat the technique described above as many times as you like while feeding the neural network new seed lyrics.
'love'--> 'love to' --> 'love to me' -->..... -->'..off the lights'
New input: 'difference between' --> 'difference between us' --> ... --> '... i hear my words'
xs, labels = input_seq[:,:-1],input_seq[:,-1]
ys = tf.keras.utils.to_categorical(labels, num_classes=total_words)
Build, compile and train the model
embed_dim=100
lstm_units = 150
learning_rate = 0.01
model= Sequential([
Embedding(total_words, embed_dim, input_length=max_se_len-1),
Bidirectional(LSTM(lstm_units)),
Dense(total_words, activation='softmax')
])
#compile model
model.compile(loss="categorical_crossentropy",
optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),
metrics=['accuracy'])
model.summary()
Accuracy:
history= model.fit(xs, ys, epochs=20)
def plot(history, string):
plt.plot(history.history['accuracy'])
plt.xlabel("Epochs")
plt.ylabel(string)
plt.show()
plot(history, 'accuracy')
Generating lyrics
text_input=input('Fist words')
nb_words=int(input('Number of words'))
for _ in range(nb_words):
sequence=tokenizer.texts_to_sequences([text_input])[0]
padded=np.array(pad_sequences([sequence], maxlen=max_se_len-1, padding='pre'))
pred=model.predict(padded)
prediction=(np.argmax(pred, axis=-1))[0]
if prediction !=0:
output=tokenizer.index_word[prediction]
text_input+=" "+output
print('Result : ', text_input)