Phần 1: Hướng dẫn Data Migration từ HTML Pages vào Drupal 7

I recently was faced with a project where the client has a "database" of items that will be brought into the brand new Drupal 7 site that we are preparing for him. Come to find out, this "database" was actually about 2,500 html documents. In order to extract the data from these html docs, I needed 3 things:

  • Tidy to clean up the HTML
  • QueryPath to extract the text from the HTML
  • Some custom PHP to bring the records into Drupal

Tidy up

I needed to use Tidy because the HTML was a bit inconsistent, and some tags were not closed properly. This made things a little problematic for QueryPath. Tidy works great, and it was really easy to use. There are many options in the manual but I chose use the -clean, -indent and -modify options. Let's take a look at what happens with these options:

  • clean: This option replaces inline css styles such as FONT, NOBR and CENTER with a class created by Tidy. Tidy will create a <style> section in the head of the document with the replaced CSS rules.
  • indent: This option adds indentation to the HTML. This helps immensely for visual inspection of the code and made things a bit easier to read.
  • modify: This option tells Tidy to modify the original source file, rather than creating a new copy. I would recommend making a backup copy of your files before using the -modify option.

Here's a sample of the HTML code that I was dealing with:

<html>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html><head><title>Ad Trust (Affect)</title>
<body>
<p class="key_list">keyword1, keyword2, keyword3, keyword4, keyword5</p>
<p class="Name">NAME: Test Record #1</p>
<p class="Description">DESCRIPTION:</p>
<p>This is the description of the html item
that we were
given
to import into our Drupal 7 site.</p>
<p class="Origin">ORIGIN:</p>
<p>The origin of this information is
is the client. He has done massive amounts
of research about <span style="font-style: italic;">pertinent information on these matters</span>.</p>
<p>This is a second paragraph under the origin item.</p>
<p class="Reliability">RELIABILITY:</p>
<p>The reliability of this information is rated <span style="font-weight: bold;">high</span> since it came directly from our client</span>
</body></html>
</html>

So a simple command in terminal like this did the cleanup for me:

tidy -cim test.html

Here’s the cleaned up output:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
 
<html>
<head>
  <meta name="generator" content=
  "HTML Tidy for Mac OS X (vers 25 March 2009), see www.w3.org">
 
  <title>Ad Trust (Affect)</title>
  <style type="text/css">
  span.c2 {font-weight: bold;}
  span.c1 {font-style: italic;}
  </style>
</head>
 
<body>
  <p class="key_list"> keyword1, keyword2, keyword3, keyword4, keyword5</p>
 
  <p class="Name">NAME: Test Record #1</p>
 
  <p class="Description">DESCRIPTION:</p>
 
  <p>This is the description of the html item that we were given to
  import into our Drupal 7 site.</p>
 
  <p class="Origin">ORIGIN:</p>
 
  <p>The origin of this information is is the client. He has done
  massive amounts of research about <span class="c1">pertinent
  information on these matters</span>.</p>
 
  <p>This is a second paragraph under the origin item.</p>
 
  <p class="Reliability">RELIABILITY:</p>
 
  <p>The reliability of this information is rated <span class=
  "c2">high</span> since it came directly from our client</p>
</body>
</html>

You can now see that the inline css has been replaced with classes, the indentions have been made, and the paragraph ending "from our client" now has the proper closing tag of </p> rather than the original </span>.

QueryPath to the Rescue

Now that the HTML is cleaned up and ready to use, we can move on to using QueryPath to extract the data. QueryPath works much like JQuery, but it's server side so it allows you to target html classes and ids, and their contents. I created a subdirectory in my Drupal root called qp, then downloaded the QueryPath library and extracted it. This first thing we need to do is include the QueryPath library:

require_once('src/QueryPath/QueryPath.php');

Now, let’s create a QueryPath object using the html file we wish to extract data from:    

$qp = htmlqp('test.html', 'body');

The second argument in the example is 'body'. That tells QueryPath to put it's search pointer inside the <body> of the document. Now I want to set up and array to hold the elements that we're extracting. This will help later on when we get into QueryPath's callback functions as well.    

$data = array();

Now that we've got our QueryPath object and $data array created, we can move the pointer around, and return elements of the html file. Let's start with the keywords:

$data['keywords'] = $qp->find(':root .key_list')->text();

The find function tells the QueryPath pointer to move to the first instance of the class key_list. I added the :root selector before .key_list so that QueryPath would start at the beginning of of the $qp object, which is the beginning of the <body>. The ->text() function tells QueryPath that we want the plain text from inside the key_list element. Now we can move to the next item we want to extract, the name.

$data['name'] = str_replace('NAME: ', '', $qp->find(':root .Name')->text());

This is a similar operation to the keywords, but we've added a str_replace function into the mix to remove the 'NAME: '. Once again, we tell QueryPath to start at the beginning of the object, with :root, and them put it's pointer on the first element with the class of Name. We use ->text() to get the plain text contents of that element. Easy enough, huh?

The next 3 items that we are going to extract are a little more tricky. For keywords and name, we could always count on there only being one item with the respective class on it, so we could just move the pointer and grab the text. For Description, Origin and Reliability it's possible for there to be multiple html elements (paragraphs, lists, titles, etc.) after the title paragraph. To extract these we will use the ->NextAll and ->each() functions:

$qp->find(':root .Description')->nextAll()->each('search_callback_description');

Here, we tell QueryPath to start at the beginning of the object, find the first element with the class of Description, then move to each element after. For each element the search_callback_description function is called. Here's the code for the search_callback_description function, then we'll step through what's going on:

function search_callback_description($index, $element) {
  global $data;
  $item = qp($element);
  if($item->hasClass('Origin')) {
    return FALSE;
  }
  $data['description'] .= $item->html();
  return TRUE;
}

We have 2 arguments, $index and $element, and the function needs to return TRUE or FALSE. Returning TRUE will continue to the next element and hit this callback again. Returning FALSE will terminate the cycle. The first thing we do is load up the $data array using the global function so we can add to it. Next we create a new QueryPath object, $item, using the $element argument so we can take a look at the class of that element, and determine if it has one. If we hit the next title paragraph, .Origin in this case, then we want to terminate the cycle by returning FALSE. To do this we use a simple if statement to check out the class of the item: $item->hasClass('Origin'). If the element has that class of Origin, we terminate, otherwise we append our $data array and return TRUE to continue to the next element. Another thing to note here is we are using the ->html() function rather than the ->text() function. This will make sure that we get our html tags included when we store the description.

To extract the Origin data, we will do almost exactly what we did for the Description, with a different callback function:

$qp->find(':root p.Origin')->nextAll()->each('search_callback_origin');

The search_callback_origin callback function only differs from the search_callback_description callback in the class that we are looking at to terminate the cycle.

function search_callback_origin($index, $element) {
  global $data;
  $item = qp($element);
  if($item->hasClass('Reliability')) {
    return FALSE;
  }
  $data['origin'] .= $item->html();
  return TRUE;
}

The final step is to get the Reliability data:

$qp->find(':root p.Reliability')->nextAll()->each('search_callback_reliability');

This time our callback is a bit more simple. Since we're at the end of the html document, the cycle will terminate on it's own, when the pointer reaches the end.

function search_callback_reliability($index, $element) {
  global $data;
  $item = qp($element);
  $data['reliability'] .= $item->html();
}

Put it all together, and this is what we have:

require_once 'src/QueryPath/QueryPath.php';
 
$filepath = 'somedir';
$dir = opendir($filepath);
 
// loop through the files in the specified directory
while ($filename = readdir($dir)) {
 
  // initiate the null $data array
  $data = array();
 
  // create the querypath object
  $qp = htmlqp($filename, 'body');
 
  // extract the html
  $data['keywords'] = $qp->find(':root .key_list')->text();
  $data['name'] = str_replace('NAME:', '', $qp->find(':root .Name')->text());
  $qp->find(':root p.Description')->nextAll()->each('search_callback_description');
  $qp->find(':root p.Origin')->nextAll()->each('search_callback_origin');
  $qp->find(':root p.Reliability')->nextAll()->each('search_callback_reliability');
 
  // DO SOMETHING WITH $data HERE
 
}
 
function search_callback_description($index, $element) {
  global $data;
  $item = qp($element);
  if($item->hasClass('Origin')) {
    return FALSE;
  }
  $data['description'] .= $item->html();
  return TRUE;
}
 
function search_callback_origin($index, $element) {
  global $data;
  $item = qp($element);
  if($item->hasClass('Reliability')) {
    return FALSE;
  }
  $data['origin'] .= $item->html();
  return TRUE;
}
 
function search_callback_reliability($index, $element) {
  global $data;
  $item = qp($element);
  $data['reliability'] .= $item->html();
}
So now we've got our data loaded into an array, now we just need to get it into Drupal. I'll cover that in the next part.