Assume we have the following code (We are using pundit for authorization and Active model serializer to format json response)
 

class User
  has_many :projects
end
 
class Project
end
 
class ProjectPolicy < ApplicationPolicy
  attr_accessor :user, :project
 
  def initialize(user, project)
    @user    = user
    @project = project
  end
 
  def show?
    user.member_of?(project)
  end
 
  def destroy?
    user.admin_of?(project)
  end
end
 
class ProjectsController
 
  def show
    authorize @project
    render json: @project
  end
 
  def destroy
    authorize @project
    @project.destroy
    head :no_content
  end
end
 
class ProjectSerializer
  attributes :name, :descroption 
end
When we request a project resource, we will get:
 
# http://myapp/projects/1
{
  "project": {
    "id": 1,
    "name": "My project",
    "description": "My project description"
  }
}
When a user tries to do any action on a project, we first authorize this user to make sure he can do this action on this project, everything seems fine.

But what about client side validation? If the mobile app or javascript code in the browser consuming this api wants to do client side validation for better user experience and to save the backend server from useless requests, then we have to return the policy with the json response like this.
 
class ProjectSerializer
  attributes :name, :descroption, :policy
 
  def policy
    project_policy = ProjectPolicy.new(scope.current_user, self.object)
    {
      show: project_policy.show?,
      destroy: project_policy.destroy?,
    }
  end
end
And in case you don’t already do this, add this line to application controller, such that we can use
methods available in rails html views (like current_user, which will be available within scope)
serialization_scope :view_context # in application_controller.rb

# http://myapp/projects/1

{
  "project": {
    "id": 1,
    "name": "My project",
    "description": "My project description",
    "policy": {
      "show": true,
      "destroy": false
    }
  }
But then we have to remember that whenever we add a policy method to our policy class and use that method in our controller, we have then to add that method to the json response, so that the API consumer is aware of this policy.

The problem is that we may easily forget to do this, and new developers contributing to our code will most likely forget, as this is not an intuitive action to do, new developer must be told about it and they must remember it! A sign of a source of bugs.  

A simple solution I used is to use ruby metaprogramming to add the whole dependency thing to the API response, here is the black magic code (places in an initializer):
 
ActiveModel::Serializer.class_eval do
  def self.add_policies
    attribute :policy
 
    def policy
      policy_instance = scope.policy(self.object)
      policy_instance.as_json
    end
  end
end
And override as_json in ApplicationPolicy so that it returns only public methods ending with question mark (which indicates they are policy methods)
 
ApplicationPolicy
  def as_json
    policies = {}
    public_methods(false).select{ |method| method.to_s =~ /\?$/ }.each do |policy_method|
      can_perform_policy_method = send(policy_method)
      policies["#{policy_method.to_s.sub(/\?$/, '')}"] = can_perform_policy_method
    end
    policies   
  end
end
Then in ProjectSerializer, we just call add_policies
 
class ProjectSerializer
  add_policies
  attributes :name, :descroption 
end
And we have all our policy methods automatically added to to the project json object .

Today I faced the task to save documents using rails paperclip saving them in AWS S3, however I needed to do this in background
in order not to block the user. I didn't want to use delayed_paperclip as it is used with only delayed_job or Resque, I wanted
something that will run on all ActiveJob adapters.

My idea is inspired originally , however that solution has points that can be enhanced: you need to create another local attachment for each attachment you add to your model, you may forget to add a migration to create it, you may forget to add it to the model, and finally: it add more columns to your model than needed.

The idea I used is as follows:
- Before saving the attachment, change the storage to Filesystem, so that it will write the document locally -not in public directory-, not to S3
- After saving the attachment, create a background process to post the document to AWS S3

The code is very basic so it can be enhanced, however this is the basic idea which is working fine till now.

 

# in your model like this
has_attached_file :attachment, process_in_background: true
# use paperclip normally, you don't have to do/change anything!
my_model.attachment = some_file


# Put this code in intializers directroy
module Extensions
  module Paperclip
    module Attachment
      attr_accessor :uploading_to_s3

      def uploading_to_s3?
        uploading_to_s3
      end
    end

    module S3BackgroundHandler
      extend ActiveSupport::Concern

      included do
        class << self


          def has_attached_file(name, options = {})
            super(name, options)

            all_options = ::Paperclip::Attachment.default_options.merge(options)
            process_in_background(name) if all_options[:process_in_background]
          end

          def process_in_background(name)
            original_options = {}
            save_to_disk = Proc.new{|name, attachment| send("#{name}_updated_at_changed?") && !attachment.uploading_to_s3?}

            before_save do
              attachment = send(name)
              if instance_exec(name, attachment, &save_to_disk)
                original_options = attachment.options.dup
                attachment.options.merge!({
                  path: LOCAL_PAPERCLIP_ATTACHMENT_PATH
                })
                attachment.extend ::Paperclip::Storage::Filesystem
              end
            end

            after_save do
              attachment = send(name)
              if instance_exec(name, attachment, &save_to_disk)
                queued_for_delete = attachment.instance_eval("@queued_for_delete")
                UploadDocumentToS3Job.perform_later(self.class.name, self.id, name.to_s, queued_for_delete)
                queued_for_delete.clear
                attachment.options.merge!(original_options)
              end
            end
          end

        end
      end

      LOCAL_PAPERCLIP_ATTACHMENT_PATH = ":rails_root/files/:class/:attachment/:id_partition/:style/:filename"

    end

  end
end

Paperclip::Attachment.include Extensions::Paperclip::Attachment
ApplicationRecord.include Extensions::Paperclip::S3BackgroundHandler


# Put this code in jobs directory
class UploadDocumentToS3Job < ApplicationJob
  queue_as :upload_document_to_s3

  def perform(class_name, id, attachment_name, queued_for_delete)
    record = class_name.constantize.find(id)
    attachment = record.send(attachment_name)

    # deleting
    s3_bucket = attachment.s3_bucket
    queued_for_delete.each do |path|
      begin
        attachment.send(:log, "deleting #{path}")
        s3_bucket.object(path.sub(%r{\A/}, "")).delete
      rescue Aws::Errors::ServiceError => e
        # Ignore this.
      end
    end

    # uploading
    original_options = attachment.options.dup
    attachment.options.merge!({
      path: ApplicationRecord::LOCAL_PAPERCLIP_ATTACHMENT_PATH
    })
    file_path = attachment.path
    attachment.options.merge!(original_options)

    File.open(file_path) do |file|
      attachment.uploading_to_s3 = true
      attachment.assign(file)
      attachment.reprocess!
      attachment.uploading_to_s3 = false
    end
    File.delete(file_path) rescue nil
  end

end

Time Series Analysis using Spark

June 29th 2017, 10:30 amCategory: Big Data 0 comments

A time series is a sequence of data points ordered by the time. Time series analysis is a methodology for extracting useful and meaningful information from these data points. Scientific oriented languages such as R supports different time series forecasting models such as ARIMA, HoltWinters and ETS. In an earlier post, we desmonstrated using these models to forecast sales of SalesForce data (SFDC). A major drawback with R is that it is single-threaded, which limits its scalbility. In this post, we will exploite different languages: Python and Spark on Scala and PySpark. For bavarity, we will focus only on ARIMA model.

Sales Forecasting using Python

Unlike R, Python does not support automatic detection of the best ARIMA model. Further, its ARIMA implementation predicts a single data point at a time. The following code handles these limitations
def predict_ARIMA_AUTO(amounts, period):
    #auto-fitting
    warnings.filterwarnings("ignore")  # specify to ignore warning messages
    best_p, best_d, best_q = 0, 1, 0
    best_aic = sys.maxint
    best_history = []
    for p in range(MAX_P):
        for d in range(MAX_D):
            for q in range(MAX_Q):
                try:
                    model = ARIMA(np.asarray(amounts, dtype=np.float64), order=(p, d, q))
                    model_fit = model.fit(disp=0)
                    model_aic = model_fit.aic
                    if model_aic < best_aic:
                        # prediction
                        size = len(amounts)
                        history = amounts[0:size]
                        for t in range(0, period):
                            model = ARIMA(np.asarray(history, dtype=np.float64), order=(p, d, q))
                            model_fit = model.fit(disp=0)
                            aic = model_fit.aic
                            bic = model_fit.bic
                            output = model_fit.forecast()
                            history.append(output[0][0])
                        best_history = history
                        best_p, best_d, best_q = p, d, q
                        best_aic = model_aic
                except:
                    continue
    print "ARIMA(", best_p, best_d, best_q, ")", "AIC=", best_aic
    return best_history[len(amounts):]
The code tries out different combinations of ARIMA parameters (p, d & q) at lines 7-9, and pick the best mode. The best ARIMA model for given data-set is the one with the lowest AIC parameter. In order to resolve the single point prediction, we append the predicted point to the given data-set, and re-predict again. This incremental prediction allowed us to predict N point instead the default Python behavior.

The following code is similar to the R code illustrated before, which forecast SFDC data. 
from datetime import datetime
from dateutil.relativedelta import relativedelta
from util import *
period = 6                  # months
# retrieve data
sparkSession = getSparkSession()
data = loadData(sparkSession)
amounts = data.rdd.map(lambda row: row.Amount).collect()
series = data.rdd.map(lambda row: row.CloseDate).collect()
# train and check accuracy
trainingSize = int(0.75 * len(amounts))
checkingSize = len(amounts) - trainingSize
trainingData = amounts[0:trainingSize]
checkingData = amounts[trainingSize:]
checkingPredicted = predictor(trainingData, checkingSize)
squareError = 0
for i in range(0, checkingSize):
    squareError += (checkingPredicted[i] - checkingData[i])**2
    print int(checkingPredicted[i]), "should be", checkingData[i]
mse = squareError**(1/2.0)
print "Root Square Error", int(mse)
# prediction
predicted = predictor(amounts, period)
month = datetime.strptime(series[0], '%Y-%m')
d = []
for i in amounts:
    d.append((month, i))
    month += relativedelta(months=1)
for i in predicted:
    d.append((month, long(i.item())))
    month += relativedelta(months=1)
df = sparkSession.createDataFrame(d, ['date', 'amount'])
saveData(df)
The code is self-explanatory and gives the same logic described before. Here we encapsulated the loading of data from database and saving the results in a custom util package. The training detects ARIMA model with the following parameters, and root mean square error = 1500959
ARIMA( 0 1 0 ) AIC= 687.049546173
The final model using the full data-set is with the following parameters
ARIMA( 2 1 0 ) AIC= 967.844839706
and it gives the following predictions
Sep 2016  1361360
Oct 2016  1095653
Nov 2016  1268538
Dec 2016  1416972
Jan 2017  1301205
Feb 2017  1334660

Sales Forecasting using Spark

Both R and Python is single threaded, which is not suitable for processing (or loading) large data. Previously, we count on the DB to retrieve, aggregate and sort the data. Here, we will get use of the power of Spark to handle this. SparkSQL introduces SQL interface for converting SQL queries into Spark jobs. Additionally, a third party implementation for ARIMA is available at https://github.com/sryza/spark-timeseries/. In this and next section we will use Spark to model and forecast the data.
import com.cloudera.sparkts.models.ARIMA
import org.apache.spark.mllib.linalg.DenseVector
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{Row, SparkSession}
 
object CloseWin {
 
  def forecast(): Unit ={
    val APP_NAME = "Sales Forecast"
    val period = 6
 
    val conf = new SparkConf().setAppName(APP_NAME).setMaster("local[2]")
    val sc = new SparkContext(conf)
    val spark = SparkSession
      .builder()
      .appName(APP_NAME)
      .getOrCreate()
 
    var dfr = spark.read
      .format("jdbc")
      .option("url", "jdbc:mysql://my-db-domain:3306/sfdc")
      .option("user", "my-db-username")
      .option("password", "my-db-password")
 
    val df2 = dfr.option("dbtable","(SELECT CloseDate, Amount FROM toast.opportunity WHERE IsWon='true' AND IsClosed='true') as win").load()
    df2.createOrReplaceTempView("opp")
    val df = df2.sqlContext.sql("SELECT DATE_FORMAT(CloseDate,'yyyy-MM') as CloseDate, SUM(Amount) as Amount FROM opp GROUP BY DATE_FORMAT(CloseDate,'yyyy-MM') ORDER BY CloseDate")
    val monthes = df.collect().flatMap((row: Row) => Array(row.get(0)))
    val amounts = df.collect().flatMap((row: Row) => Array(row.getLong(1).intValue().toDouble))
    {
      // Training
      val trainingSize = (amounts.length * 0.75).toInt
      val trainingAmounts = new Array[Double](trainingSize)
      for(i <- 0 until trainingSize){
        trainingAmounts(i) = amounts(i)
      }
 
      val actual = new DenseVector(trainingAmounts)
      val period = amounts.length - trainingSize
      val model = ARIMA.autoFit(actual)
      println("best-fit model ARIMA(" + model.p + "," + model.d + "," + model.q + ") AIC=" + model.approxAIC(actual) )
      val predicted = model.forecast(actual, period)
      var totalErrorSquare = 0.0
      for (i <- (predicted.size - period) until predicted.size) {
        val errorSquare = Math.pow(predicted(i) - amounts(i), 2)
        println(monthes(i) + ":\t\t" + predicted(i) + "\t should be \t" + amounts(i) + "\t Error Square = " + errorSquare)
        totalErrorSquare += errorSquare
      }
      println("Root Mean Square Error: " + Math.sqrt(totalErrorSquare/period))
    }
 
    {
      // Prediction
      val actual = new DenseVector(amounts)
      val model = ARIMA.autoFit(actual)
      println("best-fit model ARIMA(" + model.p + "," + model.d + "," + model.q + ")  AIC=" + model.approxAIC(actual)  )
      val predicted = model.forecast(actual, period)
      for (i <- 0 until predicted.size) {
        println("Model Point #" + i + "\t:\t" + predicted(i))
      }
    }
  }
}

Sales Forecasting using PySpark

PySpark is a Python interface for Spark. We can use the Python implementation explained before but we will change the following:

The util package to get use of Spark
def getSparkSession():
    return pyspark.sql.SparkSession.builder.getOrCreate()
 
def loadData(sparkSession):
    sparkSQL = sparkSession.read.format("jdbc").option("url", "jdbc:mysql://my-db-domain:3306/sfdc").option("dbtable", "sfdc.opportunity").option("user", "my-db-username").option("password", "my-db-password").load()
    sparkSQL.createOrReplaceTempView("opp")
    return sparkSQL.sql_ctx.sql("SELECT DATE_FORMAT(CloseDate,'yyyy-MM') as CloseDate, SUM(Amount) as Amount FROM opp WHERE IsWon='true' AND IsClosed='true' GROUP BY DATE_FORMAT(CloseDate,'yyyy-MM') ORDER BY CloseDate")
 
def saveData(result):
    result.show()
 
The prediction method to use Spark ARIMA instead the default python implementation

def predict_ARIMA_Spark(amounts, period):
    spark_context = pyspark.SparkContext.getOrCreate()
    model = spark_context._jvm.com.cloudera.sparkts.models.ARIMA.autoFit(_py2java(spark_context, Vectors.dense(amounts)), MAX_P, MAX_D, MAX_Q)
    p = _java2py(spark_context, model.p())
    d = _java2py(spark_context, model.d())
    q = _java2py(spark_context, model.q())
    jts = _py2java(spark_context, Vectors.dense(amounts))
    aic = model.approxAIC(jts)
    print "ARIMA(", p, d, q, ")", "AIC=", aic
    jfore = model.forecast(jts, period)
    return _java2py(spark_context, jfore)[len(amounts):]
Unfortunately, the third party implementation for Spark on Python is not native. It just delegates the calls to Java and exploits Py4J to wire Python with Java.

Sales Forecasting using R

June 29th 2017, 9:53 amCategory: Big Data 0 comments

Sales forecasting is the process of estimating future sales and revenue in order to enable companies to make informed business decisions and predict short-term and long-term performance. Companies can base their forecasts on past sales data, industry-wide comparisons, and economic trends. The problem of sales forecasting can be classified as a time-series forecasting, because the time is the domain in which the data (sales or revenue) got changed.

Time Series Analysis

A time series is a sequence of data points ordered by the time. Time series analysis is a methodology for extracting useful and meaningful information from these data points. Any time series can be decomposed into three components:
  • Trend: it means the regression of the data points with time. For example, a time series with a positive trend means that the values of the data points at (t+n) is larger than the ones at time (t). Here the value of the data dependes on the Time rather than the previous values.
  • Seasonality (Cycle): it means the repetition of the data over the time domain. In other words, the data values at time (t+n) is the same as the data at time (t), where n is the seasonality or cycle length
  • Noise (Random Walk): this is a time independent component (non-systematic) that is added (or subtracted) to the data points.

Based on this we can classify the time series into two classes:
  • Non-Stationary: data points with means or variance and covariance that change over the time. This is interpreted as trends or cycles or combination of them. 
  • Stationary: data points that its means and variance and covariance does not change over the time

The theories behind non-stationary signals and forecasting is not mature and modeling it is complex, which leads to inaccurate results. Luckily, non-stationary series can be transformed into stationary using common techniques (e.g. differentiation). The idea of differentiation is to subtract the data value from its predecessor, so the new series will lose a component of the time-dep

Sales-Force Forecasting

Salesforce.com (abbreviated as SF or SFDC) is a could computing company that purchase customer relationship management (CRM) products. Salesforce.com's CRM service is broken down into several broad categories: Sales Cloud, Service Cloud, Data Cloud, Marketing Cloud, Community Cloud, Analytics Cloud, App Cloud, and IoT with over 100,000 customers.
In this post we will analyze and forecast a sample sales data from salesforce CRM that shows the sales grows between 2013 and 2016, and we will predict the sales values for two business quarters. We will use two models for forecasting: ARIMA and HoltWinters, and will demonstrate how to do that using R language.
The methodology that we will follow is:
  1. Aggregate the data per month
  2. Construct the model using 75% of the data as training set
  3. Check the model accuracy using 25% of the data, and calculate the root mean square error
  4. Forecast the data for the next two quarters
We assume the data is stored at "opportunity" table, and we are interested in the following fields:
  • CloseDate: date of closing the oppertunity (Measure field)
  • Amount: the monotary amount optained
  • IsWon and IsClosed: flags for if the opportunity win/lost and closed/opened
The following plots shows the Amount vs the CloseDate, and the decomposition of the time series into trend, seasonality and residuals. 

Sales Forecasting using R

We need to install two packages: RJDBC for connecting to DB and retrieve the data, and forecast for data modeling, analysis and forecasting. The following R script shows the sales forecasting using ARMIA.
ARIMA model is an abbreviation for Autoregressive Integrated Moving Average, so it is a combination of multiple techniques:
  • Auto-regression (AR)
  • Integration (I)
  • Moving Average (MA)
library(RJDBC)
library(forecast)

rmse <- function(sim, obs){
  return(sqrt(mean((sim - obs)^2, na.rm = TRUE)))
}

construct_model <- function(data){
  data.start = strsplit(data$CloseDate[1], "-")
  data.end = strsplit(data$CloseDate[nrow(data)], "-")
  data.ts = ts(data$Amount, 
                start=c(as.integer(data.start[[1]][1]), 
                        as.integer(data.start[[1]][2])), 
                end=c(as.integer(data.end[[1]][1]), 
                      as.integer(data.end[[1]][2])), 
                frequency = 12)
  
  model = auto.arima(data.ts)
  summary(model) 
  return(model)
}

get_forecast_model <- function(close.win.opp){
  # Train with 75% of data   
  N = ceiling(0.75*nrow(close.win.opp))
  train.data = close.win.opp[1:N,]
  model = construct_model(train.data)

  # Test with 25% of data   
  test.data = close.win.opp[(N+1):nrow(close.win.opp),]
  predicted = forecast(model, length(test.data$Amount))
  cat("RMSE=", rmse(predicted$mean, test.data$Amount), "\n")
  
  # Train with all data 
  model = construct_model(close.win.opp)
  return(model)
}

drv <- JDBC("com.mysql.jdbc.Driver",  classPath="./mysql-connector-java-5.1.41-bin.jar")
conn <- dbConnect(drv, "jdbc:mysql://my-db-domain:3306/sfdc", "my-db-username", "my-db-password")

close.win.opp = dbGetQuery(conn, "SELECT DATE_FORMAT(CloseDate,'%Y-%m') as CloseDate, SUM(Amount) as Amount FROM opportunity WHERE IsWon='true' AND IsClosed='true' And CloseDate < '2016-09-01' GROUP BY DATE_FORMAT(CloseDate,'%Y-%m') ORDER BY CloseDate")

model = get_forecast_model(close.win.opp)
predicted = forecast(model, 6)
plot(predicted)

The script splits the data-set into training data (75%) and verification data (25%). Next, it build the model based on the training data. R has an implementation for ARIMA model featured with automatic detection of parameters. For the before mentioned SFDC dataset, we obtained the following model, with root mean square error = 233560
ARIMA(0,1,0)(0,1,0)[12]                   
sigma^2 estimated as 4.254e+10:  log likelihood=-218.49
AIC=438.98   AICc=439.27   BIC=439.76
Next, we build a model using all the data-set, the obtained model is
ARIMA(0,1,1)(0,1,0)[12]                   
sigma^2 estimated as 5.288e+10:  log likelihood=-343.82
AIC=691.64   AICc=692.18   BIC=694.07
Finally, we forecast the next 6 months using this model at Line 45. The data forecasting is
         Point      Forecast   Lo 80   Hi 80     Lo 95   Hi 95
Sep 2016        1894667 1599973 2189362 1443970.6 2345364
Oct 2016        1520401 1201883 1838919 1033270.3 2007532
Nov 2016        1545809 1205130 1886488 1024785.6 2066833
Dec 2016        1477773 1116289 1839257  924930.8 2030616
Jan 2017        1517143 1135988 1898299  934216.2 2100070
Feb 2017        1825764 1425904 2225624 1214230.9 2437297

R supports different time series forecasting models. In the code above, you can easily change the forecasting model by changing Line 18. Fore example to use HoltWinters model change the code to
model = HoltWinters(data.ts)
The root mean square error was 301992.9, and the predictions were in this case
         Point      Forecast   Lo 80   Hi 80     Lo 95   Hi 95
Sep 2016        1760990 1507244 2014735 1372919.6 2149059
Oct 2016        1363874 1107298 1620450  971474.8 1756273
Nov 2016        1381785 1120706 1642865  982498.9 1781072
Dec 2016        1321675 1054109 1589240  912468.3 1730881
Jan 2017        1376777 1100499 1653055  954246.8 1799307
Feb 2017        1687538 1400158 1974919 1248028.3 2127049

As we see, using HoltWinters model for our data is more appropriate than ARMIA (less mean square error)

 

Wuzzuf Visualisations

June 28th 2017, 7:14 amCategory: None 0 comments

WUZZUF is Egypt’s #1 Online Recruitment Jobs Site, especially in terms of quality job offers and candidates. More than 3,000 companies and recruiters in Egypt are actively hiring since it was launched in 2012.

Also more than 160,000 job seekers consisting of Egypt’s top professionals and fresh graduates visit WUZZUF applying to jobs each month. Wuzzuf has recently published their data for exploration. The data includes job posts between 2014-2016, and the applicants' ids and their applications timestamp. In this work, we visualize the data to give insights into the Egyptian market, its needs, its evolution and its facts.

What are the Egyptian business Needs?

Some of the most important questions for students and fresh graduates is "What is the most required skills in my domain?" or "What should I learn?" or "What gives me a competitive edge over other candidates?". Rather than speculating the answers, it is better to go to the recruiters to find the answer to what are their needs. As the recruiters time is limited, they try to declare most of their needs in the job description to filter the applicants. In this work, we exploit this valuable information and analyze the jobs descriptions of the posts between 2014 and 2016 in each business domain to capture the business needs.

The first step towards our goal is to extract the useful entities from the description. We used a third party API from https://www.meaningcloud.com to extract Tags which represents named entities as people, organization, places, etc. e.g. MS Office, Word, Excel, Weeks and Cairo, and Tags which represents significant keywords. e.g., ability, system, software, code and computer science. Secondly, we group the tags per each job title (e.g. Web Designer, Call Center, Sales Manager), and we visualize these tags according to their frequency (number of occurrences) in all the posts. In the online tool, the user can select a job title and see a word cloud visualization of the most common entities in these posts. We will demonstrate here some of the outputs of our tool.

What are top job requirements for Senior Java Developer?

The following word cloud demonstrates the most relevant technologies, skills, and languages that a senior Java developer should have, based on the job posts analysis. We see that J2EE, JavaScript amd JPA comes first, then JMS, MVC Capital, IBM Rational Rose, Linux, SQL and MySQL comes next. The tools also recommends other tools such as JBoss, Websphere, Tomcat, Weblogic. We see also that Android and Birt is less common for this job. We consider this word cloud as a helpful tool (or checklist) for Senior Java Developer to validate his knowledge.


What are top job requirements for Web Designers?

The following word cloud shows the most frequent entities in the job posts seeking web-designers. We see that JavaScript, JQuery and HTML5 are the top requirements, while Dreamweaver, Adobe Photoshop and Adobe Creative Suite comes next.  Knowing MVC Capital, Visual Arts and ASP.net are less required, yet needed.


What are top job requirements for Call Center Agents?

As we rely on text analysis, it is not always guaranteed that we visualize a meaningful tools or skills. When analyzing the Call Center Agents posts, we had the following word cloud. We can see that Cairo is the most frequent entity, and this is expected as most call centers exist there. Speaking about places, we see (in order of frequency): Maadi, Heliopolis, Nasr City and Zayed. It seems as well that there is a trend towards hiring "Males" (it is more frequent than the word "Off Gender" in the cloud map), but we see also in the words "Military Service" which seems to favor the exempted candidates. Looking at the tools we see that Excel, Word and MS Office is the required tools for this job.

Business Career: Opportunities & Salaries

An interesting career-related question is the trade-off between the experience level and the available opportunities. Additionally, the salary offered for each experience level. In this section, we capture these information and plot (per each industry) the number of vacancies per each career level, and show that against the average salaries for these vacancies. We used a bubble chart, where the bubble size is proportional with the number of vacancies, and the position of the bubble indicates the experience level and the average salary. An interactive tool that enables you to show any industry available in this link, we encourange you to go ahead and try out our tool, check your industry and get back to us by your comments.

Pharmaceuticals Industry: Career Levels against Average Salaries

The demand for pharmaceuticals vacancies is high at the "Entry" and "Experienced (Non-Manager)" with average wages 3000 EGP for entry level, and 5000 EGP for "Experienced (Non-Manager)". "Manager" vacancies posts are few and salaries varies between 12000-16000 EGP. The senior management vacancies posts are rare and salaries over 26000 EGP. There is no demand for "Student" level, which match the nature of the pharmaceuticals industry.


Computer Software Industry: Career Levels against Average Salaries

Interestingly, most of the demand in computer software industry in the "Experienced (Non-Manager)" with average salary 6000 EGP. Next comes the demand on the "Entry Level" with average salary 3000 EGP. According to the analyzed posts, the "Senior Management" salaries are not as high as expected, but this may due to other package compensations (e.g. profit share).

Egyptian Job Demand Growth per Industry

Finally, in this part we analyze the growth of job vacancies along the past 2 years. This would be important for investors and for online recruitment sites (like Wuzzuf), as it shows which industry sectors are important to approach. Besides, it is intuitive that the number of applicants (site visitors) is proportional to the number of vacancies in their industry. As per our plot below, the most appealing sectors are: Computer Software, IT Services, and Engineering Services. The grows were doubled in the last 2 years. On the other hand, the telecommunication services didn't have such growth and we see saturation in the market job demand. For other industries, The interactive tool can be used to plot their grows base on 2014-2016 data.