Time Series Analysis using Spark

June 29th 2017, 10:30 amCategory: Big Data 0 comments

A time series is a sequence of data points ordered by the time. Time series analysis is a methodology for extracting useful and meaningful information from these data points. Scientific oriented languages such as R supports different time series forecasting models such as ARIMA, HoltWinters and ETS. In an earlier post, we desmonstrated using these models to forecast sales of SalesForce data (SFDC). A major drawback with R is that it is single-threaded, which limits its scalbility. In this post, we will exploite different languages: Python and Spark on Scala and PySpark. For bavarity, we will focus only on ARIMA model.

Sales Forecasting using Python

Unlike R, Python does not support automatic detection of the best ARIMA model. Further, its ARIMA implementation predicts a single data point at a time. The following code handles these limitations
def predict_ARIMA_AUTO(amounts, period):
    warnings.filterwarnings("ignore")  # specify to ignore warning messages
    best_p, best_d, best_q = 0, 1, 0
    best_aic = sys.maxint
    best_history = []
    for p in range(MAX_P):
        for d in range(MAX_D):
            for q in range(MAX_Q):
                    model = ARIMA(np.asarray(amounts, dtype=np.float64), order=(p, d, q))
                    model_fit = model.fit(disp=0)
                    model_aic = model_fit.aic
                    if model_aic < best_aic:
                        # prediction
                        size = len(amounts)
                        history = amounts[0:size]
                        for t in range(0, period):
                            model = ARIMA(np.asarray(history, dtype=np.float64), order=(p, d, q))
                            model_fit = model.fit(disp=0)
                            aic = model_fit.aic
                            bic = model_fit.bic
                            output = model_fit.forecast()
                        best_history = history
                        best_p, best_d, best_q = p, d, q
                        best_aic = model_aic
    print "ARIMA(", best_p, best_d, best_q, ")", "AIC=", best_aic
    return best_history[len(amounts):]
The code tries out different combinations of ARIMA parameters (p, d & q) at lines 7-9, and pick the best mode. The best ARIMA model for given data-set is the one with the lowest AIC parameter. In order to resolve the single point prediction, we append the predicted point to the given data-set, and re-predict again. This incremental prediction allowed us to predict N point instead the default Python behavior.

The following code is similar to the R code illustrated before, which forecast SFDC data. 
from datetime import datetime
from dateutil.relativedelta import relativedelta
from util import *
period = 6                  # months
# retrieve data
sparkSession = getSparkSession()
data = loadData(sparkSession)
amounts = data.rdd.map(lambda row: row.Amount).collect()
series = data.rdd.map(lambda row: row.CloseDate).collect()
# train and check accuracy
trainingSize = int(0.75 * len(amounts))
checkingSize = len(amounts) - trainingSize
trainingData = amounts[0:trainingSize]
checkingData = amounts[trainingSize:]
checkingPredicted = predictor(trainingData, checkingSize)
squareError = 0
for i in range(0, checkingSize):
    squareError += (checkingPredicted[i] - checkingData[i])**2
    print int(checkingPredicted[i]), "should be", checkingData[i]
mse = squareError**(1/2.0)
print "Root Square Error", int(mse)
# prediction
predicted = predictor(amounts, period)
month = datetime.strptime(series[0], '%Y-%m')
d = []
for i in amounts:
    d.append((month, i))
    month += relativedelta(months=1)
for i in predicted:
    d.append((month, long(i.item())))
    month += relativedelta(months=1)
df = sparkSession.createDataFrame(d, ['date', 'amount'])
The code is self-explanatory and gives the same logic described before. Here we encapsulated the loading of data from database and saving the results in a custom util package. The training detects ARIMA model with the following parameters, and root mean square error = 1500959
ARIMA( 0 1 0 ) AIC= 687.049546173
The final model using the full data-set is with the following parameters
ARIMA( 2 1 0 ) AIC= 967.844839706
and it gives the following predictions
Sep 2016  1361360
Oct 2016  1095653
Nov 2016  1268538
Dec 2016  1416972
Jan 2017  1301205
Feb 2017  1334660

Sales Forecasting using Spark

Both R and Python is single threaded, which is not suitable for processing (or loading) large data. Previously, we count on the DB to retrieve, aggregate and sort the data. Here, we will get use of the power of Spark to handle this. SparkSQL introduces SQL interface for converting SQL queries into Spark jobs. Additionally, a third party implementation for ARIMA is available at https://github.com/sryza/spark-timeseries/. In this and next section we will use Spark to model and forecast the data.
import com.cloudera.sparkts.models.ARIMA
import org.apache.spark.mllib.linalg.DenseVector
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{Row, SparkSession}
object CloseWin {
  def forecast(): Unit ={
    val APP_NAME = "Sales Forecast"
    val period = 6
    val conf = new SparkConf().setAppName(APP_NAME).setMaster("local[2]")
    val sc = new SparkContext(conf)
    val spark = SparkSession
    var dfr = spark.read
      .option("url", "jdbc:mysql://my-db-domain:3306/sfdc")
      .option("user", "my-db-username")
      .option("password", "my-db-password")
    val df2 = dfr.option("dbtable","(SELECT CloseDate, Amount FROM toast.opportunity WHERE IsWon='true' AND IsClosed='true') as win").load()
    val df = df2.sqlContext.sql("SELECT DATE_FORMAT(CloseDate,'yyyy-MM') as CloseDate, SUM(Amount) as Amount FROM opp GROUP BY DATE_FORMAT(CloseDate,'yyyy-MM') ORDER BY CloseDate")
    val monthes = df.collect().flatMap((row: Row) => Array(row.get(0)))
    val amounts = df.collect().flatMap((row: Row) => Array(row.getLong(1).intValue().toDouble))
      // Training
      val trainingSize = (amounts.length * 0.75).toInt
      val trainingAmounts = new Array[Double](trainingSize)
      for(i <- 0 until trainingSize){
        trainingAmounts(i) = amounts(i)
      val actual = new DenseVector(trainingAmounts)
      val period = amounts.length - trainingSize
      val model = ARIMA.autoFit(actual)
      println("best-fit model ARIMA(" + model.p + "," + model.d + "," + model.q + ") AIC=" + model.approxAIC(actual) )
      val predicted = model.forecast(actual, period)
      var totalErrorSquare = 0.0
      for (i <- (predicted.size - period) until predicted.size) {
        val errorSquare = Math.pow(predicted(i) - amounts(i), 2)
        println(monthes(i) + ":\t\t" + predicted(i) + "\t should be \t" + amounts(i) + "\t Error Square = " + errorSquare)
        totalErrorSquare += errorSquare
      println("Root Mean Square Error: " + Math.sqrt(totalErrorSquare/period))
      // Prediction
      val actual = new DenseVector(amounts)
      val model = ARIMA.autoFit(actual)
      println("best-fit model ARIMA(" + model.p + "," + model.d + "," + model.q + ")  AIC=" + model.approxAIC(actual)  )
      val predicted = model.forecast(actual, period)
      for (i <- 0 until predicted.size) {
        println("Model Point #" + i + "\t:\t" + predicted(i))

Sales Forecasting using PySpark

PySpark is a Python interface for Spark. We can use the Python implementation explained before but we will change the following:

The util package to get use of Spark
def getSparkSession():
    return pyspark.sql.SparkSession.builder.getOrCreate()
def loadData(sparkSession):
    sparkSQL = sparkSession.read.format("jdbc").option("url", "jdbc:mysql://my-db-domain:3306/sfdc").option("dbtable", "sfdc.opportunity").option("user", "my-db-username").option("password", "my-db-password").load()
    return sparkSQL.sql_ctx.sql("SELECT DATE_FORMAT(CloseDate,'yyyy-MM') as CloseDate, SUM(Amount) as Amount FROM opp WHERE IsWon='true' AND IsClosed='true' GROUP BY DATE_FORMAT(CloseDate,'yyyy-MM') ORDER BY CloseDate")
def saveData(result):
The prediction method to use Spark ARIMA instead the default python implementation

def predict_ARIMA_Spark(amounts, period):
    spark_context = pyspark.SparkContext.getOrCreate()
    model = spark_context._jvm.com.cloudera.sparkts.models.ARIMA.autoFit(_py2java(spark_context, Vectors.dense(amounts)), MAX_P, MAX_D, MAX_Q)
    p = _java2py(spark_context, model.p())
    d = _java2py(spark_context, model.d())
    q = _java2py(spark_context, model.q())
    jts = _py2java(spark_context, Vectors.dense(amounts))
    aic = model.approxAIC(jts)
    print "ARIMA(", p, d, q, ")", "AIC=", aic
    jfore = model.forecast(jts, period)
    return _java2py(spark_context, jfore)[len(amounts):]
Unfortunately, the third party implementation for Spark on Python is not native. It just delegates the calls to Java and exploits Py4J to wire Python with Java.

Sales Forecasting using R

June 29th 2017, 9:53 amCategory: Big Data 0 comments

Sales forecasting is the process of estimating future sales and revenue in order to enable companies to make informed business decisions and predict short-term and long-term performance. Companies can base their forecasts on past sales data, industry-wide comparisons, and economic trends. The problem of sales forecasting can be classified as a time-series forecasting, because the time is the domain in which the data (sales or revenue) got changed.

Time Series Analysis

A time series is a sequence of data points ordered by the time. Time series analysis is a methodology for extracting useful and meaningful information from these data points. Any time series can be decomposed into three components:
  • Trend: it means the regression of the data points with time. For example, a time series with a positive trend means that the values of the data points at (t+n) is larger than the ones at time (t). Here the value of the data dependes on the Time rather than the previous values.
  • Seasonality (Cycle): it means the repetition of the data over the time domain. In other words, the data values at time (t+n) is the same as the data at time (t), where n is the seasonality or cycle length
  • Noise (Random Walk): this is a time independent component (non-systematic) that is added (or subtracted) to the data points.

Based on this we can classify the time series into two classes:
  • Non-Stationary: data points with means or variance and covariance that change over the time. This is interpreted as trends or cycles or combination of them. 
  • Stationary: data points that its means and variance and covariance does not change over the time

The theories behind non-stationary signals and forecasting is not mature and modeling it is complex, which leads to inaccurate results. Luckily, non-stationary series can be transformed into stationary using common techniques (e.g. differentiation). The idea of differentiation is to subtract the data value from its predecessor, so the new series will lose a component of the time-dep

Sales-Force Forecasting

Salesforce.com (abbreviated as SF or SFDC) is a could computing company that purchase customer relationship management (CRM) products. Salesforce.com's CRM service is broken down into several broad categories: Sales Cloud, Service Cloud, Data Cloud, Marketing Cloud, Community Cloud, Analytics Cloud, App Cloud, and IoT with over 100,000 customers.
In this post we will analyze and forecast a sample sales data from salesforce CRM that shows the sales grows between 2013 and 2016, and we will predict the sales values for two business quarters. We will use two models for forecasting: ARIMA and HoltWinters, and will demonstrate how to do that using R language.
The methodology that we will follow is:
  1. Aggregate the data per month
  2. Construct the model using 75% of the data as training set
  3. Check the model accuracy using 25% of the data, and calculate the root mean square error
  4. Forecast the data for the next two quarters
We assume the data is stored at "opportunity" table, and we are interested in the following fields:
  • CloseDate: date of closing the oppertunity (Measure field)
  • Amount: the monotary amount optained
  • IsWon and IsClosed: flags for if the opportunity win/lost and closed/opened
The following plots shows the Amount vs the CloseDate, and the decomposition of the time series into trend, seasonality and residuals. 

Sales Forecasting using R

We need to install two packages: RJDBC for connecting to DB and retrieve the data, and forecast for data modeling, analysis and forecasting. The following R script shows the sales forecasting using ARMIA.
ARIMA model is an abbreviation for Autoregressive Integrated Moving Average, so it is a combination of multiple techniques:
  • Auto-regression (AR)
  • Integration (I)
  • Moving Average (MA)

rmse <- function(sim, obs){
  return(sqrt(mean((sim - obs)^2, na.rm = TRUE)))

construct_model <- function(data){
  data.start = strsplit(data$CloseDate[1], "-")
  data.end = strsplit(data$CloseDate[nrow(data)], "-")
  data.ts = ts(data$Amount, 
                frequency = 12)
  model = auto.arima(data.ts)

get_forecast_model <- function(close.win.opp){
  # Train with 75% of data   
  N = ceiling(0.75*nrow(close.win.opp))
  train.data = close.win.opp[1:N,]
  model = construct_model(train.data)

  # Test with 25% of data   
  test.data = close.win.opp[(N+1):nrow(close.win.opp),]
  predicted = forecast(model, length(test.data$Amount))
  cat("RMSE=", rmse(predicted$mean, test.data$Amount), "\n")
  # Train with all data 
  model = construct_model(close.win.opp)

drv <- JDBC("com.mysql.jdbc.Driver",  classPath="./mysql-connector-java-5.1.41-bin.jar")
conn <- dbConnect(drv, "jdbc:mysql://my-db-domain:3306/sfdc", "my-db-username", "my-db-password")

close.win.opp = dbGetQuery(conn, "SELECT DATE_FORMAT(CloseDate,'%Y-%m') as CloseDate, SUM(Amount) as Amount FROM opportunity WHERE IsWon='true' AND IsClosed='true' And CloseDate < '2016-09-01' GROUP BY DATE_FORMAT(CloseDate,'%Y-%m') ORDER BY CloseDate")

model = get_forecast_model(close.win.opp)
predicted = forecast(model, 6)

The script splits the data-set into training data (75%) and verification data (25%). Next, it build the model based on the training data. R has an implementation for ARIMA model featured with automatic detection of parameters. For the before mentioned SFDC dataset, we obtained the following model, with root mean square error = 233560
sigma^2 estimated as 4.254e+10:  log likelihood=-218.49
AIC=438.98   AICc=439.27   BIC=439.76
Next, we build a model using all the data-set, the obtained model is
sigma^2 estimated as 5.288e+10:  log likelihood=-343.82
AIC=691.64   AICc=692.18   BIC=694.07
Finally, we forecast the next 6 months using this model at Line 45. The data forecasting is
         Point      Forecast   Lo 80   Hi 80     Lo 95   Hi 95
Sep 2016        1894667 1599973 2189362 1443970.6 2345364
Oct 2016        1520401 1201883 1838919 1033270.3 2007532
Nov 2016        1545809 1205130 1886488 1024785.6 2066833
Dec 2016        1477773 1116289 1839257  924930.8 2030616
Jan 2017        1517143 1135988 1898299  934216.2 2100070
Feb 2017        1825764 1425904 2225624 1214230.9 2437297

R supports different time series forecasting models. In the code above, you can easily change the forecasting model by changing Line 18. Fore example to use HoltWinters model change the code to
model = HoltWinters(data.ts)
The root mean square error was 301992.9, and the predictions were in this case
         Point      Forecast   Lo 80   Hi 80     Lo 95   Hi 95
Sep 2016        1760990 1507244 2014735 1372919.6 2149059
Oct 2016        1363874 1107298 1620450  971474.8 1756273
Nov 2016        1381785 1120706 1642865  982498.9 1781072
Dec 2016        1321675 1054109 1589240  912468.3 1730881
Jan 2017        1376777 1100499 1653055  954246.8 1799307
Feb 2017        1687538 1400158 1974919 1248028.3 2127049

As we see, using HoltWinters model for our data is more appropriate than ARMIA (less mean square error)


Wuzzuf Dataset Cleaning

June 28th 2017, 5:44 amCategory: Big Data 0 comments

Wuzzuf, is a technology firm founded in 2009 and one of the very few companies in the MENA region specialized in developing Innovative Online Recruitment Solutions for top enterprises and organizations, They successfully served 10,000+ top companies and employers in Egypt, 1.5 MILLION CVs were viewed on their platform and 100,000+ job seekers directly hired through them. In total, 250,000+ open job vacancies were advertised and now, 500,000+ users visit their website each month looking for jobs at top Employers.

Wuzzuf, has released a sample dataset on Kaggle (Which provides data science competitions, Datasets, and Kernels), named Wuzzuf Job Posts. The dataset contains 2 CSV files:

  • Wuzzuf_Job_Posts_Sample.csv: which contains Wuzzuf job posts with following attributes:

    • id: post identifier 

    • city_name: is the city of the job.

    • job_title: the title of the job

    • job_category_1, job_category_2 and job_category_3: which contains the most 3 relevant categories of the job post, e.g., Sales/Retail/Business Development

    • job_industry_1, job_industry_2 and job_industry_3:  which contains the most 3 relevant industries of the job post, e.g., Telecommunications Services

    • salary_minimum and salary_maximum: the salary limits.

    • num_vacancies: how many open vacancies for this job post.

    • career_level: enumeration of career levels e.g., Experienced (Non-Manager) and Entry Level

    • experience_years: number of years of experience.

    • post_date: publication timestamp of the post. e.g., 2014-01-02 16:01:26

    • views: count of views

    • job_description: detailed description for the job post.

    • job_requirements: main job requirements for the job post.

    • payment_period: salary payment interval e.g. Per Month

    • currency: salary currency e.g. Egyptian Pound

  • Wuzzuf_Applications_Sample.csv.zip: Which contains Wuzzuf job applications, it have the following attributes:

    • id : application identifier

    • user_id: applicant identifier

    • job_id: post identifier

    • app_date: application timestamp, e.g., 2014-01-01 07:27:52

Data Cleaning

The published data-set had many free-text fields, Wuzzuf system does not enforce a certain list of items to choose for them, which makes processing and aggregation difficult. A common handling such as lower case all values and remove trailing spaces was performed. Additionally some fields needed special handling such as:

  1. city_name:

    • this attribute is free text attribute, which represents Egyptian cities, but it has the following problems:

      • Misspelling of words. (i.e. cairo , ciro , ciaro).

      • Arabic names (i.e. القاهرة )

      • Outside Egypt cities (i.e. riyadh, doha)

      • General Cities (i.e. all egypt cities , any location)

      • Group of Cities (i.e."cairo, alexandria - damanhor")

    • All the above issues has been solved by: 

      • Outside Egypt cities: a static list of outside cities has been mapped to category "outside".

      • Arabic (Non-ascii) names: has been replaced statically be the corresponding english words.

      • General Cities: a static list of outside cities has been mapped to category "any"

      • Remove Not Needed substrings such as "el" and "al".

      • Replace "and" and "or" substrings with "-" to be splitted on next steps.

      • Group of Cities : attribute has been splitted on several delimiters.

      • Misspelling of words: a static list of valid cities and its states in Egypt has been created , each misspelled word has been mapped to the most similar word of valid cities, a threshold T has been used to accept only similarities above that threshold, otherwise city will mapped to "any" category.

      • Added new state attribute by mapping each city_name to its state from valid cities & states categories.

  2. job_category_1, job_category_2, job_category_3 attributes: cleaning was done by removing placeholder text "Select" from all 3 attributes, and merging the 3 attributes into one attribute called job_categoriesd

  3. job_industry_1, job_industry_2, job_industry_3 attributes: cleaning was done by removing placeholder text "Select" from all 3 attributes, and merging the 3 attributes into one attribute called job_industries.

  4. experience_years attribute: we manually normalizing free text onto one of 3 forms 'x+' or 'x-y' or 'x', and split the experience_years attribute to 2 new attributes experience_years_min and experience_years_max , which contains the minimum and maximum years respectively needed for a job.

  5. post_date attribute: we generated a new attribute called "post_timestamp" which has the POSIX timestamp value of the post_date attribute (i.e., the number of seconds that have elapsed since January 1, 1970 midnight UTC/GMT)

  6. job_description and job_requirements attributes: we noticed that job_requirements attribute are normally empty, so we added new derived attribute called "description" which contains the concatenation of job_requirements and job_description attributes.

Derived Attributes

The next step was deriving some attributes from these data sets. We derived the following attributes:

  1. Tags attributes: we used a third party API from MeaningCloud to extract Tags from the "description" attribute (recall that it contains the data from job_description and job_requirements). Thus, we added to the data-set the following attributes: 

    • quotation_list: which represents quoted text. e.g., you take on the responsibility of growing the Academy by increasing business and handling operational and technical challenges that arise in the process.

    • entity_list : which represents named entities as people, organization, places, etc. e.g. MS Office, Word, Excel, Weeks and Cairo

    • concept_list: which represents significant keywords. e.g., ability, system, software, code and computer science.

    • relation_list: This attribute could be used to provide a summary for the description attribute as it highlights most of the important notes from the description part.

    • money_expression_list: which represents money expressions, e.g., 2000 EGP

    • time_expression_list: which represents time expressions, e.g., 6 Months at least and 8.5 hours

    • other_expression_list: which contains other expressions such as alphanumeric patterns. e.g., php5

  2. applications_count attribute to each post, which calculates how many applicants has been applied to this job post. (derived from applications data-set)

  3. first_applicant_timestamp and  last_applicant_timestamp attributes per each post, which calculates the POSIX timestamp of the first and last applicant that applied to this job post.  (derived from applications data-set)

Case Study: Jobs Recommendation

Extracting tags from the jobs opens the doors for recommending jobs for applicants. We exploit the entity_list and concept_list attributes to rank the job posts that are relevant to the given applicant. On the other hand, we build a keywords vector from the applicant profile. The recommendation selection works calculating the 10 highest matching scores using a heuristic model. 

As a proof of concept, we analyzed some applicants profiles (their private information such as names was removed for anonymity). The following is a sample of the analyzed profiles:

Our system recommends the following job posts to him (ordered from best or lowest):

  • System Administrator, with job_id = "8c872132", with score = 1.0

  • System Administrator, with job_id = "a13539c", with score = 1.0

  • System Administrator, with job_id = "c820bb65", with score = 1.0

  • Technical Support Engineer French Speaker, with job_id = "6783a66f", with score = 1.0

  • Data Entry & IT Technician, with job_id = "8c872132", with score = 1.0

  • Software Developer SharePoint, with job_id = "990d3300", with score = 1.0

  • .Net Developer, with job_id = "22a298c7", with score = 1.0

  • Operations Support Engineer, with job_id = "69318c48", with score = 1.0

  • Microsoft Product Manager, with job_id = "eb59b18d", with score = 0.8571

For further details about these job posts, check the dataset using these IDs.

Case Study: Job Summary

In some use cases, it is useful to summarize a bulk of text and get the most relevant information from a given text. As mentioned before, we added the relation_list attribute which highlights most of the important notes from the description part. Using this attribute, we can provide a short, yet descriptive, summary of the post. As an example, here is the original job post description and its summary for job post number 68417a3c.

Original job description (1317 characters)
Temporary Vacancy (4 Months)
Students/Undergraduates are acceptable.
Working as a promoter at Key Accounts' stores that sell OneCard Items like " Mobile and Electronic Chains in Egypt" required  :
Daily contacting with the sales staff  working at the store/s for:
Training them and handling their complaints.
Delivering all POS materials as much as possible “posters, flyers and danglers,,, etc” .
Following up the stock movement and sales volumes.
Updating our files with the dealers’ data base.
Getting feedback and requested info about the market and competitors.
Achieving the monthly targeted plan of performing successful No of presentation s for the end users at the store/s that is set by the Distribution Team leader / Supervisor/ Manager.
Sending reports of these presentations to the Distribution Team leader/Supervisor/ Manager on Daily basis.
Bachelor Degree
Good command and knowledge of Microsoft office (Word-Excel- Outlook)
Good writing and speaking English
highly Presentab
Having training and educating skills
Having selling Skills
Communication & Personal Effectiveness/ Interpersonal Skills
Building Relationships
Delivering Excellent Service / Service Orientation
Problem Solving
Marketing & Sales
Team Working
0 up to 2years experience in sales, distribution& marketing activities

Job Summary (544 characters)
Students/Undergraduates are acceptable.
Working as a promoter at Key Accounts' stores that sell OneCard Items like " Mobile and Electronic Chains in Egypt" required :
Updating our files with the dealers’ data base.
Getting feedback and requested info about the market and competitors.
Achieving the monthly targeted plan of performing successful No of presentation s for the end users at the store/s that is set by the Distribution Team leader / Supervisor/ Manager.
0 up to 2 years experience in sales, distribution& marketing activities

The original description contains around 1317 characters while the summarized one contains only 544 characters, which an approx 59% reduction in size.


Automated Job Recommendations

January 17th 2016, 4:09 amCategory: Big Data 0 comments


   One of the most important foundations to companies to properly grow is to choose the perfect employees that fit their needs. Not only the technical skills but also their culture that fits their aspects. On the other side, choosing the most appropriate job for job-seekers is very important to advance their career and quality of life.


   Recruitment process has become increasingly difficult, choosing the right employee among plenty of candidates for each job, each having different skills, cultures and ambitions.

   Recommender system technology aims to help users find items that match their personal interests. So we can use this technology to solve the recruitment problem for both sides; companies, to find appropriate candidates, and job-seekers, to find favorable positions. So let's talk about what can science offer to solve this bidirectional problem.


   In the world of data science, the more information we can get, the more accurate results we may have. So let’s start with available information we can collect about job-seekers and jobs.

Job Seeker

  • Personal information, such as language, social situation and location.
  • Information about current and past professional positions held by the candidate. This section may contain companies names, positions, companies descriptions, job start dates, and job finish dates. The company description field may further contain information about the company (for example the number of employees and industry).
  • Information about the educational background, such as university, degrees, fields of education, start and finish dates.
  • IT skills, awards and publications.
  • Relocation ability.
  • Activities (like, share, short list)


  • Required skills.
  • Nice to have skills.
  • Preferred location (onsite, work from home).
  • Company preferences.

Information extraction

   To get all this information we may face another big challenge. Most of this information may have been included in a plain text (ex. resume, job post description, etc.). So, we need to apply some knowledge extraction techniques on those texts, so we can get a complete view about requirements and skills.


Informations enrichment

   A good matching technique requires more than just looking into explicit information only. For example, a job post that is defined to be looking for a candidate who has a knowledge about Java programming language while on the other side a candidate who has claimed knowledge with Spring framework, so if we are just looking for a candidate with explicit defined Java skill then this candidate will not be shown in the view, although he had an implicit Java skill by using Spring framework. To solve this problem we need to enrich both the job and candidate information by using a knowledge base that can link these two skills or at least knows that using Spring framework implicitly imply a Java skill. This will improve the accuracy by looking into the meanings and concepts instead of the explicit information only.



Let’s define some guidelines we need to take care of when working on the matching.

  • Matching of individuals to job depends on skills and abilities that individuals should have.
  • Recommending people is a bidirectional process, it should take into account the preferences of both recruiter and candidate.
  • Recommendations should be based on the candidate’s attributes, as well as the relational aspects that determine the fit between the person and the team members/company with whom the person will collaborating (fit candidate to company not only the job).
  • Must distinguish between must-have and nice-to-have requirements and improve their contribution with dynamic weights.
  • Use ontology to categorize jobs as a knowledge base.
  • Enrich job-seeker and jobs profiles with knowledge base (knowing Cakephp framework implies knowing also PHP).
  • Data normalization to avoid domination.
  • Learning from the others job transitions.

Recommendation Techniques

Let’s list some techniques used in recommendation fields, no technique is suitable for all cases, you need first to link it with type of data you have and your whole case.

  • Collaborative filtering
    • In this technique, we are looking for a similar behavior between job-seekers, so we can find job-seekers who have similar interests, and make job recommendations from their jobs of interest.
  • Content-based filtering
    • In this technique we are looking for profile’s content for both: the job-seeker and the job post, and get the best matching between them, regardless of the behavior of the job-seeker and the company that posted the job.
  • Hybrid
    • Weighted In which, the score of item recommendation is calculated from the results of all of used recommendation techniques that are available in the system.
    • Switching The system uses some criteria to switch between recommendation techniques.
    • Mixed In which large number of recommendations are applied simultaneously, so we can mix the results from both recommenders.
    • Feature Combination uses the collaborative information as additional feature data for each item and use content-based techniques over this improved data set
    • Cascade It comprises a staged process. In this technique, one recommendation technique is used first to produce a rough ranking of candidates and a second technique refines the recommendation.
  • 3A Ranking algorithm maps (job, company and job-seeker) to a graph with relations between them (apply, favorite, post, like, similar, match, visit, … etc), then depends on relations and ranking to recommend items.
    • Content base is used to calculate similarity between jobs, job-seekers and companies, and each of them with the other one (match profile between job and job-seeker).

General recommendation system Architecture

Figure 1 - General System architecture.


   To create a self improved system you need to get feedback for the results you produced to correct yourself over time. The best feedback you can get is the feedback from the real world, so we can depend on job-seekers and companies feedback to adjust the results as desired.

  • Explicit: Ask users to rate the recommendations (jobs / candidates)
  • Implicit: Track interaction on recommendations (applied, accepted, short list and ignored)

Further Reading

  • Proceedings of the 22nd International Conference on World Wide Web. Yao Lu, Sandy El Helou, Denis Gillet (2013). A Recommender System for Job Seeking and Recruiting Website.
  • JOURNAL OF COMPUTERS, VOL. 8. Wenxing Hong, Siting Zheng, Huan Wang (2013). A Job Recommender System Based on User Clustering.
  • International Journal of the Physical Sciences Vol 7(29). Shaha T. Al-Otaibi, Mourad Ykhlef (July 2012). A survey of job recommender systems.
  • Proceedings of the fifth ACM conference on Recommender systems. Berkant Cambazoglu, Aristides Gionis (2011). Machine learning job recommendation.

It's our pleasure to highligh the initiative taken by our data team leader Ahmed Mahran to effectively contribute to the Spark Time Series project, created by Sandy Ryza, a senior data scientist at Cloudera, the leading big data solutions provider.


Time Series data has gained an increasing attention in the past few years. To quote Sandy Ryza:


Time-series analysis is becoming mainstream across multiple data-rich industries. The new Spark-TS library helps analysts and data scientists focus on business questions, not on building their own algorithms.


Find the full story here, where he introduces SparkTS, and accredits our contributor.


We are, forever, indebted to the open source community, it enabled us to create wonderful feats. It's our deep belief that we should give back to the community in order to guarantee its health and sustainability. We are proud that we effectively contributed to such great project and we are looking forward to more.