Thank you for visiting cavemonkey50.com. If you're looking for new content, cavemonkey50.com is no longer actively updated. I now maintain a new blog over at ronaldheft.com. I hope to see you there!

SoC: AtomPub Week 5 Status

Another week down, another step closer to a working AtomPub importer. Unfortunately, week 5 went anything but according to plan. Sure, I fixed the bugs found at the end of last week, but new issues came to light, requiring changes in the week’s plan.

New Issues Found

Additional testing early in the week by my mentor Lloyd brought forth some coding challenges. First off, Lloyd found a few error messages on import. Those were quickly resolved, but once Lloyd made it past the error messages, he found the performance of the importing to be subpar.

After adding some performance measurements to the importer, the source of the problem was revealed. The multiple requests of different feeds of data adds up over time. Essentially, for every post the importer needs to ping the post URL to check for a 404 (draft status), request the comments feed, and request the trackbacks XML-RPC data. Each post was taking over a second, quickly adding up over time.

Progress, Progress, Progress

Unfortunately, nothing can be done at this time to lower the request time; the feed requests are at the mercy of the internet. However, the notifications can be enhanced so a user is not wondering why the importer has not finished.

So, after discussing the issue with Lloyd, I think a progress bar is needed in this situation. Unfortunately, due to the nature of PHP applications, I can’t just add a progress bar out of nowhere. I will need to modularize the importer into a more AJAXy interface, so an AJAX progress bar can be updated with the import status. I will begin looking at solutions for this later in the upcoming week.

Even More Issues

The performance issues was not my only problem this week. Lloyd found that when importing from a blog with 3,000 entries, the importer ran out of memory. Surprisingly, it ran out of memory around 130MB, which would be crazy under a normal web server, given PHP is typically limited to 16MB of RAM.

Once this issue was pointed out to me, I quickly found the problem. I had been putting all entries in a massive array before looping through them to import. So, to correct this I limited the importer to batches of 20 posts at a time, freeing the memory between each set of posts. This appears to have corrected the problem.

In addition to the memory leak, I found out that the comments feed has the same 20 comments as time restriction that the main feed had. Already familiar with the issue for the main feed, I corrected that issue and all comments started to be imported.

Outlook Looks Good

Despite the massive amounts of issues discovered this week, I think the future of the importer is looking better than ever. Some major hurdles were overcome this week, and because of that, this week ends with a more memory efficient, error-free version of the importer.

With the new discoveries, obviously the plan has been changed a bit. Currently, I’m looking at finally (and yes, I mean finally) writing the code to automatically detect the Atom API feed at the beginning of next week. From there, I will begin working on updating the interface to be more AJAXy, providing notifications along the way.

4 Comments

  1. 1 Joseph Scott on Jun 27, 2008 at 7:52 pm:

    Might be worth queuing up the XML-RPC requests and then using a multicall request http://scripts.incutio.com/xmlrpc/advanced-client-construction.php to reduce the total number of requests you need to make.

  2. 2 Ronald Heft on Jun 27, 2008 at 9:13 pm:

    That’s a great idea. I did some calculations and a multi-call would save about 6 seconds per 20 posts, so it’s definitely a worthwhile implementation.

    Outside of XML-RPC, I just reduced the 404 check time by performing only a HEAD request, so that adds a savings of about 5 seconds per 20 posts.

    So, if I can get the multi-request request working, a set of 20 posts would take roughly 10 seconds, depending on how many comments there are. Definitely better than 20 seconds per 20 posts.

  3. 3 Richard Hertz on Jul 1, 2008 at 12:38 pm:

    Is there a reason why you are using an array? They are terribly inefficient. I believe PHP supports objects, right? Can’t you create a custom data type by using a class, similar as one would do in C++?

  4. 4 Ronald Heft on Jul 1, 2008 at 12:51 pm:

    I’m using arrays because WordPress still needs to support PHP4, and arrays are actually faster under PHP4. Objects are more like an associative array under PHP4, and carry extra processing time. Sure, the game changes under PHP5, but for now PHP4 is needed for compatibility.

One Trackback/Pingback

  1. [...] pm on June 27, 2008 | # | Tags: atompub importer, weekly Week 5 status report is now available. [...]