MoodyWorld
This technical documentation provides a comprehensive guide for the MoodyWorld project, detailing its architecture, functionalities, and the underlying technologies employed.
Technical Documentation
This documentation offers a clear overview of the project’s data flow, from the ingestion of news data from various sources to its processing and storage in MongoDB, followed by its transfer to BigQuery for advanced analytics. The document outlines the specific SQL transformations used to structure, filter, and prepare the data for machine learning or analytical purposes, emphasizing the creation of tables for split testing and training datasets. Additionally, it elucidates the interaction between different components of the project, ensuring a thorough understanding of the MoodyWorld’s technical landscape and facilitating effective collaboration, maintenance, and future enhancements.
Data fetch
This section covers the end-to-end data fetch process, including global news retrieval from the News API, MongoDB storage, and the main application script. It also encompasses database connectivity and automated news tagging for improved analysis.
1. NEWS API
This script defines an asynchronous function `getWorldNews` that fetches top headlines in English from the News API and structures each article into a new schema, which includes title, description, source, and publication date. The articles are stored in an array and returned. If an error occurs, it logs the error message. The function is exported for external use.
apinews.js:
// Define an asynchronous function to fetch top headlines from the News API
async function getWorldNews() {
var newsArray = new Array(); // Initialize an array to store news articles
try {
// Fetch top headlines in English with a limit of 100 articles
const response = await fetch('https://newsapi.org/v2/top-headlines?language=en&apiKey=c8bb7c8b03bf4b528b528265a036c9dd&pageSize=100');
const data = await response.json(); // Parse the response JSON
const newsList = await data.articles; // Extract articles from data
// Map each article into a structured schema and add to newsArray
newsList.forEach(async (e) => {
const newSchema = {
title: e.title || '',
description: e.description || '',
source: {
id: e.source.id || '',
name: e.source.name || ''
},
publishedAt: e.publishedAt || '',
tag: ''
};
newsArray.push(newSchema);
});
return newsArray; // Return the formatted news articles
} catch (error) {
// Log error if the fetch fails
console.error('An error occurred while getting the news:', error.message);
}
}
// Export the getWorldNews function for external use
module.exports = {getWorldNews}
2. DATA BASE CONNECTION
This script configures environment variables, connects to a MongoDB database, and defines a function `run` that inserts a new schema into the ‘News’ collection of the ‘WorldNews’ database. It ensures continuous insertion even if an error occurs and closes the database connection properly after operations. The `run` function is exported for external use.
database.js:
// Load environment variables from .env file
require('dotenv').config()
// Import MongoDB client and API version management
const { MongoClient, ServerApiVersion } = require('mongodb');
// Construct MongoDB connection URI using environment variables
const uri = `mongodb+srv://${process.env.MONGO_USER}:${process.env.PASSWORD}@test.disdl6x.mongodb.net/?retryWrites=true&w=majority&appName=Test`;
// Define database and collection names
const dbName = "WorldNews";
const collection = "News";
// Initialize MongoDB client with server API settings
const client = new MongoClient(uri, {
serverApi: {
version: ServerApiVersion.v1,
strict: true,
deprecationErrors: true,
}
});
// Define an asynchronous function to insert news data into the database
async function run(newSchema) {
try {
await client.connect(); // Connect to MongoDB
console.log("Successful connection to MongoDB");
console.log(newSchema); // Log the schema to be inserted
const db = client.db(dbName); // Access the database
const usuariosCollection = db.collection(collection); // Access the collection
await usuariosCollection.insertMany(newSchema, {ordered: false}); // Insert documents into the collection
} finally {
await client.close(); // Ensure the client is closed after operation
}
}
// Export the run function for use in other modules
module.exports = {run}
3. MAIN APP
This script defines an asynchronous `mainFunction` that calls `getWorldNews` to fetch news data and logs it. If an error occurs during this process, the catch block will handle and log the error, ensuring that the script provides feedback on its operation and any issues encountered.
index.js:
// Import the 'run' function from the database module
const { run } = require("./database");
// Import the 'getWorldNews' function from the apinews module
const { getWorldNews } = require("./apinews");
// Define an asynchronous function to fetch world news
async function mainFunction() {
try {
const news = await getWorldNews(); // Fetch world news
return news; // Return fetched news
} catch (error) {
console.error('Error:', error); // Log errors if fetching fails
}
}
// Define an asynchronous function to save fetched news into the database
async function saveNews() {
try {
const result = await mainFunction(); // Fetch the news using mainFunction
console.log("News fetched, saving to database..."); // Log status
await run(result); // Save the fetched news to the database using 'run' function
console.log("News saved to the database successfully."); // Log success message
} catch (error) {
console.error('Error saving news to the database:', error); // Log any errors encountered during saving
}
}
saveNews(); // Execute the function to fetch and save news
4. TAGGING APP
This script ensures proper handling of news documents in a MongoDB database, providing functionalities for tagging, deleting, or updating news entries based on user input.
tagging.py:
# Import necessary modules and functions
import pymongo
import os
from bson import ObjectId # For working with MongoDB object IDs
from dotenv import load_dotenv
load_dotenv('/Users/mariana/Documents/Dev/GitHub/API-news/.env')
# Connect to the MongoDB database using environment variables
myclient = pymongo.MongoClient(f'mongodb+srv://{os.getenv("MONGO_USER")}:{os.getenv("PASSWORD")}@test.disdl6x.mongodb.net/?retryWrites=true&w&maj=rity')
mydb = myclient['WorldNews']
mycollection = mydb['News']
emotions= ["Happiness", "Sadness", "Anger", "Fear", "Surprise", "Disgust", "Neutral"]
# Process each document without a tag
for e in mycollection.find({'tag':''}):
newsid = e['_id'] # Extract the MongoDB ObjectID
newstitle = e['title']
newsdescription = e['description']
# Display document count by tags
print("Document count")
print('')
total = 0
for c in mycollection.aggregate([
{'$match': {'tag': {'$exists':True}}},
{'$group': {'_id': '$tag', 'count': {'$sum': 1}}},
{'$sort':{'_id':1}}
]):
tag_temp = 'Empty' if c['_id'] == '' else c['_id']
total += c['count']
print(f"{tag_temp + ' has: ':20s} {c['count']}")
print(f"Total documents: {total}")
print('')
# Handle news items with removed titles
if newstitle == '[Removed]':
print("The title is empty. Do you want to delete it? Answer Y or N")
response = input()
if response.lower().strip() == 'y':
mycollection.delete_one({'_id': ObjectId(newsid)})
print(f"The news with _id: {newsid} has been deleted")
else:
print("Okay, then deal with it.")
print("")
# Tag news items with valid titles
else:
print(newstitle)
print(newsdescription)
print('Add the tag here: ', end="")
tag = input().strip().capitalize()
if tag not in emotions:
print(f"{tag} is invalid. Choose from the following options: {emotions}")
print('')
print("Add the new tag here: ", end="")
print('')
tag = input().strip().capitalize()
# Update the document with a valid tag
print('')
print(newsid)
print(f'HERE IS THE NEW TAG: {tag}')
print('')
response = mycollection.update_one({'_id': ObjectId(newsid)}, {'$set': {'tag': str(tag)}})
if tag == 'Neutral':
count = mycollection.count_documents({'tag': 'Neutral'})
print(f'There are {count} documents with the tag {tag}')
print(f'Do you want to keep it? Y/N ', end="")
res = input('').lower().strip()
if res == 'n':
mycollection.delete_one({'_id': ObjectId(newsid)})
print('')
print(f'The news you just tagged as {tag} has been deleted from the DB')
print('')
else:
print('This news has been added to your DB =)')
# Log the result of the update
if response.matched_count == 0:
print("No document found with the given ID.")
else:
print("Document updated successfully.")
print("")
Data architecture
This section outlines the data architecture of the MoodyWorld project, describing the import of data from MongoDB to BigQuery and detailing subsequent SQL operations and model creation. It highlights the use of BigQuery for data preparation and the development of a multiclass logistic regression model to classify news titles.
1. Dataflow
Importing data from Mongo DB, using the MOD to BQ template:
Pipeline requirements:
- The target BigQuery dataset must exist.
- The source MongoDB instance must be accessible from the Dataflow worker machines.
Output format:
[
{"name":"id","type":"STRING"},
{"name":"source_data","type":"STRING"},
{"name":"timestamp","type":"TIMESTAMP"}
]
- The
source_data
field contains the document in JSON format.
Required parameters:
- MongoDbURI
- Database
- Collection
2. BigQuery
- Split the table
This query is used to structure and transform nested JSON data into a flat table format for easier querying and analysis in BigQuery.
CREATE OR REPLACE TABLE `moody-app-dd775.moodyappdataset29032024.splittable` AS
SELECT
id,
JSON_EXTRACT(source_data, ‘$.title’)AS_title,
JSON_EXTRACT(source_data, ‘$.description’)AS_description,
JSON_EXTRACT(source_data, ‘$.tag’)AS_tag,
JSON_EXTRACT(source_data, ‘$.publishedAt’)AS_publishedAt,
JSON_EXTRACT(source_data, ‘$.source’)AS_source
FROM `moody-app-dd775.moodyappdataset29032024.modyapptable29032024`;
- Filtering Parameters
This query filters out and retains only the columns_title
,_description
,_publishedAt
, and_tag
from thesplittable
table, effectively creating a more focused or simplified table where only these pieces of information are required.
- Split the training and test batches
The purpose of this query is to prepare the data for machine learning or statistical modeling by creating a random split of the data into training and testing sets. This is a common practice in data science to evaluate the performance of a model on unseen data.
It transforms the data from thefilteringparameters
table by adding a new column that assigns each row to either a training set or a test set based on a random selection.CASE WHEN RAND() < 0.8 THEN 'train' ELSE 'test' END as
batch« : This line introduces a new column namedbatch
. For each row, it generates a random number between 0 and 1 usingRAND()
. If the number is less than 0.8, the row is labeled as ‘train’; otherwise, it is labeled as ‘test’. This effectively splits the data into training and testing sets, with approximately 80% of the data allocated to training and the remaining 20% to testing.
FROM `moody-app-dd775.moodyappdataset29032024.filteringparameters`
- Multiclass model (logic regression) WiP