Automated Job Recommendations

January 17th 2016, 4:09 amCategory: Big Data 0 comments

 

   One of the most important foundations to companies to properly grow is to choose the perfect employees that fit their needs. Not only the technical skills but also their culture that fits their aspects. On the other side, choosing the most appropriate job for job-seekers is very important to advance their career and quality of life.

 

   Recruitment process has become increasingly difficult, choosing the right employee among plenty of candidates for each job, each having different skills, cultures and ambitions.


   Recommender system technology aims to help users find items that match their personal interests. So we can use this technology to solve the recruitment problem for both sides; companies, to find appropriate candidates, and job-seekers, to find favorable positions. So let's talk about what can science offer to solve this bidirectional problem.

Information

   In the world of data science, the more information we can get, the more accurate results we may have. So let’s start with available information we can collect about job-seekers and jobs.

Job Seeker

  • Personal information, such as language, social situation and location.
  • Information about current and past professional positions held by the candidate. This section may contain companies names, positions, companies descriptions, job start dates, and job finish dates. The company description field may further contain information about the company (for example the number of employees and industry).
  • Information about the educational background, such as university, degrees, fields of education, start and finish dates.
  • IT skills, awards and publications.
  • Relocation ability.
  • Activities (like, share, short list)

Job:

  • Required skills.
  • Nice to have skills.
  • Preferred location (onsite, work from home).
  • Company preferences.
 

Information extraction

   To get all this information we may face another big challenge. Most of this information may have been included in a plain text (ex. resume, job post description, etc.). So, we need to apply some knowledge extraction techniques on those texts, so we can get a complete view about requirements and skills.

 

Informations enrichment

   A good matching technique requires more than just looking into explicit information only. For example, a job post that is defined to be looking for a candidate who has a knowledge about Java programming language while on the other side a candidate who has claimed knowledge with Spring framework, so if we are just looking for a candidate with explicit defined Java skill then this candidate will not be shown in the view, although he had an implicit Java skill by using Spring framework. To solve this problem we need to enrich both the job and candidate information by using a knowledge base that can link these two skills or at least knows that using Spring framework implicitly imply a Java skill. This will improve the accuracy by looking into the meanings and concepts instead of the explicit information only.

 

Guidelines

Let’s define some guidelines we need to take care of when working on the matching.

  • Matching of individuals to job depends on skills and abilities that individuals should have.
  • Recommending people is a bidirectional process, it should take into account the preferences of both recruiter and candidate.
  • Recommendations should be based on the candidate’s attributes, as well as the relational aspects that determine the fit between the person and the team members/company with whom the person will collaborating (fit candidate to company not only the job).
  • Must distinguish between must-have and nice-to-have requirements and improve their contribution with dynamic weights.
  • Use ontology to categorize jobs as a knowledge base.
  • Enrich job-seeker and jobs profiles with knowledge base (knowing Cakephp framework implies knowing also PHP).
  • Data normalization to avoid domination.
  • Learning from the others job transitions.
 

Recommendation Techniques

Let’s list some techniques used in recommendation fields, no technique is suitable for all cases, you need first to link it with type of data you have and your whole case.

  • Collaborative filtering
    • In this technique, we are looking for a similar behavior between job-seekers, so we can find job-seekers who have similar interests, and make job recommendations from their jobs of interest.
  • Content-based filtering
    • In this technique we are looking for profile’s content for both: the job-seeker and the job post, and get the best matching between them, regardless of the behavior of the job-seeker and the company that posted the job.
  • Hybrid
    • Weighted In which, the score of item recommendation is calculated from the results of all of used recommendation techniques that are available in the system.
    • Switching The system uses some criteria to switch between recommendation techniques.
    • Mixed In which large number of recommendations are applied simultaneously, so we can mix the results from both recommenders.
    • Feature Combination uses the collaborative information as additional feature data for each item and use content-based techniques over this improved data set
    • Cascade It comprises a staged process. In this technique, one recommendation technique is used first to produce a rough ranking of candidates and a second technique refines the recommendation.
  • 3A Ranking algorithm maps (job, company and job-seeker) to a graph with relations between them (apply, favorite, post, like, similar, match, visit, … etc), then depends on relations and ranking to recommend items.
    • Content base is used to calculate similarity between jobs, job-seekers and companies, and each of them with the other one (match profile between job and job-seeker).
 

General recommendation system Architecture



Figure 1 - General System architecture.

Evaluation

   To create a self improved system you need to get feedback for the results you produced to correct yourself over time. The best feedback you can get is the feedback from the real world, so we can depend on job-seekers and companies feedback to adjust the results as desired.

  • Explicit: Ask users to rate the recommendations (jobs / candidates)
  • Implicit: Track interaction on recommendations (applied, accepted, short list and ignored)
 

Further Reading

  • Proceedings of the 22nd International Conference on World Wide Web. Yao Lu, Sandy El Helou, Denis Gillet (2013). A Recommender System for Job Seeking and Recruiting Website.
  • JOURNAL OF COMPUTERS, VOL. 8. Wenxing Hong, Siting Zheng, Huan Wang (2013). A Job Recommender System Based on User Clustering.
  • International Journal of the Physical Sciences Vol 7(29). Shaha T. Al-Otaibi, Mourad Ykhlef (July 2012). A survey of job recommender systems.
  • Proceedings of the fifth ACM conference on Recommender systems. Berkant Cambazoglu, Aristides Gionis (2011). Machine learning job recommendation.

It's our pleasure to highligh the initiative taken by our data team leader Ahmed Mahran to effectively contribute to the Spark Time Series project, created by Sandy Ryza, a senior data scientist at Cloudera, the leading big data solutions provider.

 

Time Series data has gained an increasing attention in the past few years. To quote Sandy Ryza:

 

Time-series analysis is becoming mainstream across multiple data-rich industries. The new Spark-TS library helps analysts and data scientists focus on business questions, not on building their own algorithms.

 

Find the full story here, where he introduces SparkTS, and accredits our contributor.

 

We are, forever, indebted to the open source community, it enabled us to create wonderful feats. It's our deep belief that we should give back to the community in order to guarantee its health and sustainability. We are proud that we effectively contributed to such great project and we are looking forward to more.

   
    In search space, pagination always has to happen. Solr has the feature of basic paging. In basic paging, you simply specify start and rows parameters. start indicates where the returned results should start and rows specifies how many documents are returned. With basic paging, partial index exporting and migration is a problem. Since basic paging needs to sort all the results first before returning the desired subset, it needs large amount of memory if start is of high order. For instance, start=1000000 and rows=10 causes an inefficient memory allocation to happen due to the sorting of 1000010 documents. In a distributed environment, the case is worse because the engine has to fetch 1 million documents from each shard, sort them then return the result set.

PhoneGap is an open source framework for building cross-platform mobile apps using basic web skills like HTML5, Javascript and CSS. It is just all about wraping your app with PhoneGap to fast deploy in diffrent plateforms. I started developing applications since April'12, at the begining I was really impressed since I really used all my web skills to develop apps but after only 6 months I began to be shoked by performance and ui issues like smooth animations and more other issues.

Shared storage and HDFS (pros and cons)

Shared storage in a distributed compute cluster

Figure 1 - Shared storage in a distributed compute cluster.
 

HDFS was built to scale out using disks directly attached to commodity servers so as to form a basis for a distributed processing framework that exploits the data locality principle - moving code to data is much more cheaper than the other way around. Hence, incorporating a shared storage system, as opposed to directly attached storage (DAS), in order to host the HDFS is not straightforward as centralized storage is not the nature of the HDFS.

 

Moreover, shared storage systems have some potential limitations. As a shared service, it needs to efficiently handle multiple concurrent IO requests from the numerous clients of the compute cluster. Also, the network is the channel for communicating data from and to the shared storage, the thing that would consume much of the network’s bandwidth. While in a DAS model the network is the communication channel as well for sharing data, data local computations reduce network bandwidth consumption significantly. Moreover, comparing network bandwidth to the regular internal per-machine data busses, data busses are faster as, besides the underlying hardware technology, they don’t incur the overhead imposed by the networking stack which constitutes the network infrastructure and protocols (layered protocols headers, checksum computations …).

 

On the other hand, shared storage systems have unique attractive features that worth to be considered either while constructing a new distributed data processing cluster or when there is a need to exploit an existing infrastructure with a pre-installed shared storage system. One important feature is separating compute from storage resources. This implies independent administration and scalability of resources; it would be possible to scale only one resource type, either compute or store, without unnecessarily scaling the other. In addition to that, compute resources will be almost fully utilized in computations rather than taking the overhead of servicing shared data to other consumers. The thing that would ensure efficiency and high availability for data sharing.

 

Moreover, as a plus, shared storage systems provide application agnostic features releasing more overhead from compute resources while adding more strengths to the cluster. Those features like: security, backups and snapshots, efficient capacity utilization via compression, enhanced IO performance by leveraging the latest storage and network technologies, and failure handling and high availability. So, with shared storage system, the namenode is no more a single point of failure as the metadata could be backed up safely without the need for other namenodes designed as part of the Hadoop HA (High Availability) feature. Besides, data replication, as an inherent feature in shared storage systems, relieves block replication burden off HDFS daemons saving CPU, memory and network bandwidth as lesser block replications simplifies write pipelines and downsizes block pools.

 

How HDFS works with shared storage

For HDFS, we can categories shared storage systems into two main groups: HDFS aware and HDFS agnostic. HDFS aware storage systems are those that provide native support for HDFS sometimes to the extent that HDFS daemons are running inside the box of the storage system. In that case, HDFS is readily tuned for the shared storage system, moreover, the rest of the cluster resources become pure compute resources. Alternatively, HDFS agnostic storage systems don’t know about the existence of the HDFS hence their incorporation has much more involvement.

 

For an HDFS agnostic shared storage system, aside from the recommendations of the storage system provider, a typical deployment would be to allocate separate storage partitions for each host in the cluster and to distribute HDFS daemons such that one or two master hosts to have the namenode and the secondary namenode daemons and each of the rest of the slave hosts to have at least one datanode daemon. Given such configurations, let’s inspect the cost per read and write operations.

 
Data flow while Host 2 performs a write with replication factor of 2

Figure 2 - Data flow while Host 2 performs a write with replication factor of 2.
 

Figure 2 illustrates the data flow if a client on Host 2 is writing two replicas, one local on Host 2 and another remote on Host 1. The following are the paths for the data transmission:

  1. The client on Host 2 transmits the data to the colocal data node over kernel network.
  2. The data node on Host 2 transmits the data to the shared storage over network.
  3. The data node on Host 2 transmits the data to the data node on Host 1 over network.
  4. The data node on Host 1 transmits the data to the shared storage over network.

 

The same data is transmitted over network three times and is written on the same storage twice. Hence, write operations are very expensive in terms of network bandwidth and storage capacity consumption. It is meaningless to consider block replication on a shared storage especially if the shared storage system readily provide data replication.

 
Data flow while either hosts performs a read

Figure 3 - Data flow while either hosts performs a read.
 

In Figure 3, if a client on Host 1 is reading a block maintained by the colocal data node, the following are the possible data transmission paths:

  1. The data is transmitted over network from the shared storage to the data node on Host 1 (link C).
  2. The data node on Host 1 transmits the data to the colocal client (link A) over kernel network.

 

However, if a client on Host 2 reads a block only maintained by the remote data node on Host 1, possible paths are:

  1. The data is transmitted over network from the shared storage to the data node on Host 1 (link C).
  2. The data node on Host 1 transmits the data over network to the remote client on Host 2 (link B).

 

In a remote read, the same data is transmitted over network twice while only once in local reads. Hence, remote reads are expensive than local reads in terms of network bandwidth consumption. The following table summarizes the cost of data transmission in the different scenarios discussed above.

 

Operation Data Transmission Paths (Cost)
Network Kernel Network
Local write (one replica) 1 1
Remote write (one local replica and one remote replica) 3 1
Local read 1 1
Remote read 2 0

 

Increasing block replication, increases data locality and read performance however decreases the write performance and wastes storage. It would be great if there is a way to minimize the length of data paths by taking shortcuts.

 

Short-circuit reads

Short-circuiting local and remote reads from a shared storage

Figure 4 - Short-circuiting local and remote reads from a shared storage.
 

Short-circuit local reads

 

HDFS provides one shortcut for local reads. If the client is reading a file from a collocated datanode, the client can read the file directly from the disk bypassing the datanode. In figure 4, a client on Host 1 requests to read a file from the colocal datanode. Then, the datanode redirects the client to the required block in the filesystem and the client reads the file directly from the shared storage over network.

 

Short-circuit remote reads

By analogy from the short-circuit local read, a short-circuit remote read is defined when a client requests a read from a remote datanode, then the remote datanode provides direct access to the file for the client. This could be a valid scenario if only the data is stored on a shared storage system to which all hosts have direct access. However, HDFS is not aware whether it’s hosted on a shared storage or not. Hence, HDFS is not providing a shortcut for remote reads but it is still possible to hack around it.

 

HDFS client knows whether it’s colocal with a datanode by checking the datanode IP address against the list of network interfaces on the same host as the client. If the datanode IP is one of them, the client knows that the datanode is colocal. Then, there are two possible ways for the client to get the file: the legacy short-circuit read, or the new short-circuit read. In the legacy implementation, the client requests, through TCP, the absolute path of the block from the datanode. Then, the client opens the block file and reads the content directly after receiving the path info from the datanode. While in the new implementation, the client connects to the datanode through a Unix domain socket. Then, the datanode opens the file and bypasses the file descriptor via the Unix domain socket to the client. After that, the client can consume the file directly through the descriptor.

 

HDFS short-circuit read mainly involves the following three actions:

  1. Datanode locality check
  2. Interprocess communication through Unix domain socket (new implementation only)
  3. File direct access

 

Datanode locality check: It is possible to come around this check by adding, to each host, a dummy network interface, one for each remote datanode. Each dummy interface is assigned the same IP address as the corresponding remote datanode. It’s important to hide these interfaces from network so as to be able to reach the real datanodes.

 

Interprocess communication through Unix domain socket: A Unix domain socket works on the kernel level and connects parties on the same host only. Processes meet at local files but share data through memory. To the extent of our knowledge, it’s possible to forward a Unix domain socket between two hosts hence enabling two remote processes to communicate through the Unix domain socket. For example, OpenSSH can do Unix domain socket forwarding. Hadoop provides only one configuration parameter for setting the socket address (dfs.domain.socket.path) such that the datanode and the DFS client on the same node use the same socket address which makes it difficult to forward the DFS client connection to the appropriate datanode. So, it’s not a straightforward task to hack the short-circuit new implementation for short-circuiting remote reads.

 

File direct access: Clients can have direct access to files maintained by a remote datanode if the files are hosted in a shared file system to which all hosts have access. Each datanode should be assigned an independent data directory that is hosted on a shared file system and that have the same mount point on all hosts. For example, datanode 1 on Host 1 is assigned a data directory /hdfs/data1/ and datanode 2 on Host 2 is assigned a data directory /hdfs/data2/. Both directories are hosted by a shared file system (say NFS) and are mounted on each host at the mount points /hdfs/data1/ and /hdfs/data2/, respectively.

 

Fortunately, we can hack around the legacy implementation of short-circuit local reads as it does not involve any Unix domain communication. The following table summarizes the cost of data transmission in the different read scenarios after enabling short-circuit remote reads. Unfortunately, HDFS does not provide a shortcut for writes hence the write cost stays the same.

 

Operation Data Transmission Paths (Cost)
Network Kernel Network
Local read 1 0
Remote read 1 0

 

Now, local and remote reads have the same reduced cost. This is a substantial improvement to HDFS for reading from a shared storage.

 

Proof of concept

Experiment Setup

Short-circuit remote read experiment setup (SCRR is a shorthand for Short Circuit Remote Read)

Figure 5 - Short-circuit remote read experiment setup (SCRR is a shorthand for Short Circuit Remote Read).
 

As illustrated in figure 5, we build a small cluster of three machines with the following configurations. SCRR3 is the master node hosting an NFS server simulating the shared storage and exposing two locations (/home/host/hdfs1 and /home/host/hdfs2) as the storage directories for the cluster datanodes. It also hosts the cluster namenode and secondary namenode. SCRR2 and SCRR1 are both slave nodes each one hosting a datanode. The datanode on SCRR1 is configured with a storage directory that is hosted on the shared storage and mounted on /home/host/hdfs1. Also, the datanode on SCRR2 has a storage directory that is hosted on the shared storage however mounted on /home/host/hdfs2. Both storage directories are mounted on both nodes (SCRR1 and SCRR2) on the same mount points such that a datanode directory could be accessed on both nodes through the same path.

 

In our experiment, we assign each node a specific role. SCRR3 has the shared storage role, SCRR2 has the DFS client role and SCRR1 has the datanode role. We configure the HDFS cluster with minimum replication factor of one and we enable the legacy short circuit local read. We shutdown the datanode on SCRR2 and configure a dummy network interface with the same address of the node on SCRR1. We upload a file on the HDFS cluster (that would be kept by the datanode on SCRR1). We then run a simple DFS client on the node SCRR2 that reads the file byte by byte counting the number of bytes read. After the DFS client finishes we disable the short circuit remote read configuration by deleting the dummy network interface and then we re-run the same DFS client. We repeatedly run the same DFS client alternating between the two modes (short circuit enabled and short circuit disabled) while watching the network bandwidth consumption.

 

Results

Network bandwidth consumption on each node.

Figure 6 - Network bandwidth consumption on each node.
 

Figure 6 shows the network bandwidth consumption on each node while performing the experiment. The short circuit is enabled in runs (B, D and F) and is disabled in runs (A, C, E and G). The following table summarizes the network bandwidth consumption on each node in terms of data transmission and reception rates as well as what data is being communicated. It is clear enough how short circuiting remote reads from shared storage reduces network bandwidth consumption significantly, not to mention the applications running time.

 

Run Label Datandoe DFS Client Shared Storage
Receive Transmit Receive Transmit Receive Transmit
A High
(data from shared storage)
High
(data to DFS client)
High
(data from datanode)
Low
(acks …)
Low
(acks …)
High
(data to datanode)
B Idle
(short circuit)
Idle
(short circuit)
High
(data from shared storage)
Low
(acks …)
Low
(acks …)
High
(data to DFS client)
C, E, G Low
(acks … data in OS cache)
High
(data to DFS client)
High
(data from datanode)
Low
(acks …)
Idle
(data cached in datanode side)
Idle
(data cached in datanode side)
D, F Idle
(short circuit)
Idle
(short circuit)
Idle
(data in OS cache)
Idle
(data in OS cache)
Idle
(data cached in DFS client side)
Idle
(data cached in DFS client side)
 

Appendix: Dummy network interface configuration on Linux:

Assume a cluster of three slaves

 

Host IP Address MAC Address
Host1 1.1.1.1 00:00:00:00:00:01
Host2 1.1.1.2 00:00:00:00:00:02
Host3 1.1.1.3 00:00:00:00:00:03

  
  1. On each host, create a dummy network interface and assign it a suitable name, one for each remote datanode. For example, on Host1:
    sudo ip link add dummyDataNode2 type dummy
    sudo ip link add dummyDataNode3 type dummy
  2. On each host, assign the same address of the real datanode for each device. For example, on Host 3:
    sudo ifconfig dummyDataNode1 1.1.1.1
    sudo ifconfig dummyDataNode2 1.1.1.2
  3. Make sure that all machines have the MAC address of the real datanode. For example, on Host 1, assuming eth0 is the interface through which Host 1 is supposed to communicate with HDFS cluster network:
    sudo arp -s 1.1.1.2 00:00:00:00:00:02 -i eth0
    sudo arp -s 1.1.1.3 00:00:00:00:00:03 -i eth0
  4. Make sure that all machines have a route to the real datanodes through the correct interface. For example, on Host 2:
    sudo route add -host 1.1.1.1 dev eth0
    sudo route add -host 1.1.1.3 dev eth0
  5. On each host, hide the dummy interface from the network by disabling ARP for the dummy device and deleting all routing entries that involve the dummy device. Use both the route and ip-route commands to query for the routing entries involving the dummy device. For example, on Host 1:
Switch off ARP on the devices so as not send nor to respond to any ARP requests:
	sudo ifconfig dummyDataNode2 -arp
	sudo ifconfig dummyDataNode3 -arp

  Query routes with route command:
	route

	Output:
	Kernel IP routing table
	Destination  Gateway    Genmask         Flags Metric Ref    Use Iface
	default      1.1.1.254  0.0.0.0         UG    0      0        0 eth0
	1.1.1.0      *          255.255.255.0   U     1      0        0 eth0
	1.1.1.0      *          255.255.255.0   U     1      0        0 dummyDataNode2
	1.1.1.0      *          255.255.255.0   U     1      0        0 dummyDataNode3

  Delete the routing entries associated with the dummy interfaces:
	sudo route del -net 1.1.1.0 netmask 255.255.255.0 dev dummyDataNode2
	sudo route del -net 1.1.1.0 netmask 255.255.255.0 dev dummyDataNode3

  Query routes with ip-route command for each dummy interface:
	ip route show table local dev dummyDataNode2

	Output:
	broadcast 1.1.1.0  proto kernel  scope link  src 1.1.1.2 
	local 1.1.1.2  proto kernel  scope host  src 1.1.1.2
	broadcast 1.1.1.255  proto kernel  scope link  src 1.1.1.2
	ip route show table local dev dummyDataNode3

	Output:
	broadcast 1.1.1.0  proto kernel  scope link  src 1.1.1.3
	local 1.1.1.3  proto kernel  scope host  src 1.1.1.3
	broadcast 1.1.1.255  proto kernel  scope link  src 1.1.1.3

  Delete the routing entries associated with the dummy interfaces:
	sudo ip route del table local dev dummyDataNode2 to local 1.1.1.2
	sudo ip route del table local dev dummyDataNode2 to broadcast 1.1.1.0
	sudo ip route del table local dev dummyDataNode2 to broadcast 1.1.1.255
	sudo ip route del table local dev dummyDataNode3 to local 1.1.1.3
	sudo ip route del table local dev dummyDataNode3 to broadcast 1.1.1.0
	sudo ip route del table local dev dummyDataNode3 to broadcast 1.1.1.255



 

Read More: